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ABSTRACT 


Azad, Ariful Ph.D., Purdue University, May 2014. An Algorithmic Pipeline for An¬ 
alyzing Multi-parametric Flow Cytometry Data. Major Professor: Alex Pothen. 

Flow cytometry (FC) is a single-cell profiling platform for measuring the phe¬ 
notypes (protein expressions) of individual cells from millions of cells in biological 
samples. In the last several years, FC has begun to employ high-throughput technolo¬ 
gies, and to generate high-dimensional data, and hence algorithms for analyzing the 
data represent a bottleneck. This dissertation addresses several computational chal¬ 
lenges arising in modern cytometry while mining information from high-dimensional 
and high-content biological data. A collection of combinatorial and statistical algo¬ 
rithms for locating, matching, prototyping, and classifying cellular populations from 
multi-parametric flow cytometry data is developed. 

The algorithms developed in this dissertation are assembled into a data analy¬ 
sis pipeline called flowMatch. This pipeline consists of five well-defined algorithmic 
modules for (1) transforming data to stabilize within-population variance, (2) identi¬ 
fying phenotypic cell populations by robust clustering algorithms, (3) registering cell 
populations across samples, (4) encapsulating a class of samples with templates, and 
(5) classifying samples based on their similarity with the templates. Each module 
of flowMatch is designed to perform a specihc task independent of other modules of 
the pipeline. However, they can also be employed sequentially in the order described 
above to perform the complete data analysis. 

The flowMatch pipeline is made available as an open-source R package in Bio¬ 
conductor (http://www.bioconductor.org/). I have employed flowMatch for classify¬ 
ing leukemia samples, evaluating the phosphorylation effects on T cells, classifying 
healthy immune prohles, comparing the impact of two treatments for Multiple Scle- 
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rosis, and classifying the vaccination status of HIV patients. In these analyses, the 
pipeline is able to reach biologically meaningful conclusions quickly and efficiently 
with the automated algorithms. The algorithms included in flowMatch can also be 
applied to problems outside of flow cytometry such as in microarray data analysis 
and image recognition. Therefore, this dissertation contributes to the solution of 
fundamental problems in computational cytometry and related domains. 
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1 INTRODUCTION 


1.1 Diversity of the cellular systems 


The immune system of a living organism consists of numerous components per¬ 
forming in a coordinated manner in order to provide protection against diseases as 
well as abnormally behaving host cells. The immune system must be able to detect 
a wide variety of virulent agents, distinguish normal host cells from dis-regulated 
pre-cancerous cells, as well as learn and store information about various external 
perturbants [^. In order to decipher the operation and the regulatory networks of 
the immune system, it is crucial to obtain the quantitative description of its cellular 
components. 

Different subsets of immune cells can be distinguished on a basis of their pheno¬ 
types, which in turn determine their functions and roles. The phenotypes of each 
cell are typically dehned by a combination of morphological features such as size, 
shape, granularity etc. and the abundances of surface and intracellular markers such 
as the cluster of differentiation (CD) proteins. The immune cells can be organized 
in a hierarchy where cells performing general functions are positioned at the top, 
while cells performing specihc functions are placed at the bottom of the hierarchy. 


I display a simplihed diagram of the hierarchy of immune cells in Figure 1.1, where 
morphological features are highlighted in the left subhgure and CD protein markers 
expressed by common sub-types of white blood cells (leukocytes) are shown in the 
right subhgure. 

An established way to classify different phenotypes relies on the presence of specihc 
proteins on the surface of cells. The surface molecules are assigned a CD (cluster of 
diherentiation) number related to the type of specihc monoclonal antibodies (mAb) 
that are shown to bind to that epitope. For example, the mature helper T cells (also 
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Figure 1.1. Simplified diagrams showing the cellular hierarchy of human 
immune cells, (a) Heterogeneous immune cells develop from haematopoi¬ 
etic stem cell (HSC). Different types of cells have distinctive morphological 
features such as size, shape, granularity etc. (b) Cluster of differentiation 
(CD) protein markers are shown for the common sub-types of white blood 
cells (leukocytes). (Images are modihed from Wikimedia Commons under 
Creative Commons license.) 


known as CD4+ T cells) express CD45, CDS and CD4 proteins. They are called 
CD45’''CD3’''CD4’'' cells in a common notation. Here, ‘-1-’ and ‘high’ indicate higher 
abundances of a marker, and ‘ —’ and ‘low’ indicate lower levels of it. Identifying the 
CD4+ cell subset is pivotal in AIDS prognosis since the HIV virus infects this cell type 
and signihcantly reduces the number of functional CDd"*" T cells towards the end of 
an HIV-1 infection [^. Similar to the characterization of T cells, other cell types can 
be identihed, described, and isolated on the basis of their morphology, physiology, 
and CD based immunophenotyping patterns. The phenotypic patterns of cells are 
used to study their roles, interactions with other cell groups, and responses to various 
stimuli in healthy or diseased systems [^|^. For these purposes, flow cytometry (FC), 
a prohling method measuring the phenotypic responses of the immune system at the 
resolution of single cells, is commonly employed. 
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Figure 1.2. (a) A schematic diagram of a flow cytometer showing the 
hydrodynamic focusing of the fluid sheath, laser, optics (in simplihed 
form, omitting focusing), photomultiplier tubes (PMTs), analogue-to- 
digital converter, and analysis workstation, (b) At the interrogation point, 
the size, granularity and expression of markers for a cell are measured by 
using forward scatter, side scatter and fluorescence intensities respectively. 
(Images are modihed from Wikimedia Commons under Creative Commons 
license.) 


1.2 Flow cytometry (FC) 

Flow cytometry (FC) is a platform for measuring various features of individual 
cells from millions of cells in a sample. The cellular features measured by FC include 
several morphological features such as size, shape, granularity etc. and the abundance 
of a number of proteins expressed by the cell. In typical FC experiments, cells in a 
suspension (or other biological particles) flow in a stream of fluid passing the region 
where a laser beam interrogates each cell. The light scattered by a cell in the forward 
and perpendicular directions - known as Forward Scatter (FS) and Side Scatter (SS) 
channels respectively - reveal the size, shape and granularity of the cell. The identity 
and abundance of the proteins are measured by the fluorescence signals emitted by 
the laser-excited, fluorophore-conjugated antibodies bound to the target proteins in a 
cell . I show a schematic diagram of a flow cytometer and how it measures different 
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features of a cell in Figure m In a flow cytometer, the mixture of fluorescence 
signals are roughly separated by several band-pass hlters and each signal is collected 
into a separate fluorescence (FL) channel. In this context, I often use the terms 
“a fluorescence channel” and “a protein marker” interchangeably because we infer 
information about the latter through the signal collected at the former. 

Current fluorescence-based technology supports the measurements of up to 20 
proteins simultaneously in each cell from a sample containing millions of cells [^. 
Although emerging atomic mass cytometry systems such as CyTOF can measure 
more than 40 markers per cell, fluorescence detection is still the most used tool for 
single-cell measurements. FC is employed to study the complexity of the immune 
system and the changes in its components upon exposure to various external per- 
turbants such as pathogens, chemical compounds (drugs), or vaccination, as well as 
events such as aging, autoimmune reactions, or presence of cancer. FC is now rou¬ 
tinely employed to illustrate immune cells development and maturation [^, to study 
immune responses in the presence of a pathogen, to diagnose diseases of the immune 
system j^, and to develop novel vaccines (e.g., against HIV) [To]. 


1.3 Flow cytometry data analysis 

An FC sample measuring p features (known as parameters in FC) from n cells 
is represented by an n x p matrix A. The matrix element A{i,j) represents the 
measurements of the feature in the cell. To this end, a cell is represented 
by a p-dimensional feature vector capturing the morphology and protein expression 
prohle of the cell, which are effectively measured by the light scatter and fluorescence 
signals. In a typical experiment, we measure hundreds of samples with each sample 
measuring multi-dimensional features for up to millions of cells. I show a schematic 


view of a data set generated in a typical FC experiment in Figure 1.3 However, the 


actual numbers of features, cells, and samples vary from one experiment to another. 
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Figure 1.3. A schematic view of flow cytometry samples from a typical 
experiment. 


Analysis of a collection of FC samples can lead to diagnosis of various diseases, 
monitoring of immune responses in the presence of a pathogen, development of novel 
vaccines etc. However, before making useful biological conclusions, the raw FC mea¬ 
surements mixed with various sources of noise need to pass through several analysis 
steps in a systematic order. These analysis steps are often guided by specihc aspects 
of FC data and are customized to address the biological needs. I present a list of 
attributes arising in a typical multi-parametric FC experiment and relate each at¬ 


tribute to a necessary data analysis step in Table 1.1, This list is not exhaustive and 
is subject to change depending on the design of experiment and the type of biological 
questions answered by a particular experiment. I briefly discuss several analysis steps 
in the next few paragraphs. 


1.3.1 Removing unintended cells 


In the preprocessing phase, various unintended events such as doublets, dead cells. 


debris, etc. are removed from the FC data. Figure 1.4 shows several preprocessing 


and quality control steps used in a typical FC data analysis. A “doublet” is a pair of 
attached cells, which has larger area but smaller height in the forward scatter (FS) 
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Table 1.1 

Different analysis steps of FC data (right column) are originated from FC 
data attributes (left column). 


FC data attributes 

Data analysis steps 

debris, dead cells, doublets 

preprocessing & quality control 

multiple fluorescence channels (2-20) 

spectral unmixing (compensation) 

variance increases with mean 

variance stabilization 

multiple cellular features (4-20) 

multidimensional distribution 

many observations per sample (10^-10^) 

down-sampling 

several discrete cell subsets (2-50) 

clustering 

similar cell populations across samples 

cluster matching or labeling 

many samples per cohort 

meta-clustering & templates 

multiple classes of samples 

classihcation 


channel relative to the single intact cells. Figure 1.4[ a) shows how we can separate 
single cells (inside the red polygon gate) from the doublets (outside of the red polygon 
gate). Cell viability dyes, e.g., amine reactive viability dyes ViViD and Aqua Blue, 


are often used to separate dead cells from live cells 11 . Live cells are shown with 
a red polygon gate in Fig. |1.4[ b) and dead cells outside of the gate are discarded 
from further analysis. Furthermore, boundary events can also be removed from the 
histogram of total fluorescence as shown in Figure [T~4| (c). Cells outside of the read 
vertical lines are either too dim or too bright in terms of the total fluorescence emission 
and can be removed as outliers. Several other preprocessing steps are occasionally 


performed as part of quality control, for example see the discussion in 12 


1.3.2 Spectral unmixing (compensation) 

Flow cytometry measures the abundance of protein markers in a cell by the flu¬ 
orescence intensities of the fluorophore-conjugated antibodies bound to the target 
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Figure 1.4. Removing unintended events from an FC sample, (a) Single 
intact cells (inside the red polygon gate) are separated from the doublets 
(outside of the red polygon gate), (b) A viability marker (ViViD) is used 
to remove dead cells (outside of the red polygon gate), (c) Cells emitting 
very low or very high fluorescence signals (outside of the read vertical 
lines) are removed as potential outlying events. 


proteins. Because of the overlap of florescence spectra emitted by different fluo- 
rophores, a detector intended for a particular marker also captures partial emissions 
from other fluorophores. The correct signal at each detector is then recovered by a 


process called the Spectral unmixing or compensation 13,14 


To see how a simple spectral unmixing method works, consider an FC system 
measuring the emission of p fluorophores with p detectors. Then the general form of 
the compensation system is given by the following equation: 


o = Ms + a. 


( 1 . 1 ) 


where, 

o = vector of the observed signals at p detectors. 

M = p X p spillover matrix. The off-diagonal element M[i,j] denotes the con¬ 
tribution of the fluorochrome to the detector of the fluorochrome. The 
diagonal elements represent the fraction of signal found in the appropriate chan¬ 
nel. Each column of the matrix adds to unity. 


s = vector of original signal emitted from the p fluorochromes. 
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Figure 1.5. (a) A pair of correlated channels due to the overlap of the 
corresponding fluorochrome spectra, (b) The same pair of channels after 
signals are unmixed from each other. 


a = autofluorescence vector of length p measuring the amount of background 
fluorescence. 

The goal of the spectral unmixing is to calculate the actual signal vector s. The 
simplest and widely used algorithm performing compensation is a straightforward 
application of linear algebra that requires the solution of a linear system of equations 


involving the spillover matrix M: 14,15 


s = M ^ X to 


( 1 . 2 ) 


For example, Fig. 1.5[ a) shows a pair of correlated channels due to the overlap of the 
corresponding fluorochrome spectra. The correlation is removed after the signals are 

illustrated in Fig. |1.5Kb). The accuracy of 


unmixed from each other by Eq. 1.2 


as 


the reconstructed signal, however, depends on the nature of errors generated by the 
fluorescence emission process and photo-electric circuitry of the flow cytometer. The 


error model can be approximated by a mixture of Poisson and Gaussian noise 15,16 
a more accurate compensation scheme is discussed by Novo et al. (Tsl. 


1.3.3 Data transformation and variance stabilization 

After performing initial pre-processing, FC data is often transformed for proper 
visualization and subsequent analysis. Various nonlinear functions are typically 















9 


used for this purpose, such as logarithm, hyperlog, biexponential, asinh transforma¬ 
tions 


17 20 . These transformations are aimed primarily to resolve cell populations 


uniformly and to allow unimpeded visual interpretation especially by using a series 
of 2-D FC scatter plots. However, owing to the nature of photon-counting statistics, 
variance of a fluorescence signal monotonically increases with the average signal inten¬ 
sity (average protein expression). This signal-variance dependence creates problem 
in uniform feature extraction and comparing cell populations with different levels of 
marker expressions. To remove the correlation between signal and variance, variance 
stabilization (VS) is performed. 


Variance stabilization (VS) [^,22 is a process that decouples signal from the 
variance such that the variance become approximately constant along the complete 
range of fluorescence intensities. In FC, VS can be achieved by transforming data 
with properly tuned transformation parameters. Several past works followed this 
approach of parameter optimization, but mostly with an objective to maximize the 
likelihood of individual cell populations being generated by a mixture of multivariate- 


normal distributions on the transformed scale 23,24. To remove the correlation 
between signal and variance, I developed a variance stabilization (VS) method called 
flow VS that decouples signal from the variance by an inverse hyperbolic sine (asinh) 
transformation function whose parameters are estimated by a likelihood-ratio test. I 
will discuss variance stabilization for FC samples in more detail in Chapter 


1.3.4 Identifying cell populations (gating or clustering) 

For a given set of markers, a cell population or cell cluster is a subset of cells in 
a sample with similar physical and fluorescence characteristics, and thus biologically 
similar to other cells within the subset but distinct from those outside the subset. In 
conventional FC practice, trained operators identify cell populations by visualizing 
cells in a collection of two-dimensional scatter plots. This is traditionally a manual 
process known as “gating” in cytometry. However, with the ability to monitor large 
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numbers of cellular features simultaneously, the visualization-based approach is inad¬ 
equate to identify high-dimensional populations. To address this problem, a number 
of researchers have proposed several customized clustering algorithms to identify cell 


populations in FC samples. These algorithms include model-based clustering 25 27 


density based clustering 28,29, spectral clustering 30 and other non-parametric 


approaches 31 . The FlowCAP consortium (http://flowcap.flowsite.org/) designed a 
set of challenges with an aim to identify the best clustering algorithms for different 


FC datasets and Aghaeepour et. ah 32 provides a state-of-the-art summary of the 
held. 

Automated population identihcation is a pivotal step in the FC data analysis. 
Given the availability of a large collection of software packages for solving the cluster¬ 
ing problem, it is not always trivial to select the best option to analyze a particular 
dataset, especially when no “ground truth” or “gold standard” is available. In this 
situation various cluster validation methods can be used to evaluate the quality of 


different clustering solutions from which the best option can be picked 33,34 . How¬ 
ever, it has been shown that an ensemble clustering computed from a consensus of a 
collection of simple clustering solutions often outperforms any individual algorithm 


for FC data 32 . Finding an optimal ensemble clustering is an NP-hard problem and 


different heuristics are often used to obtain an approximate solution 35 . Consider¬ 
ing the importance of the clustering problem, I describe a new consensus clustering 
algorithm that maintains a hierarchy of clustering solutions from different algorithms. 
This algorithm provides a new perspective of ensemble clustering to the FC commu¬ 
nity. I will discuss different clustering algorithms and a novel consensus clustering 
algorithm in Chapter 


1.3.5 Registering cell populations across samples 

Phenotypically distinct cell populations often respond differently upon perturba¬ 
tion or change of biological conditions. This changed response results in increased 
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protein expression in each cell, or in changes in the fraction of cells belonging to a 
cell type. To identify the population-specihc changes, we register cell clusters across 


different samples 27,36 . The population registration problem can be solved by com¬ 
puting a similarity measure between each pair of cell clusters across samples and 
matching clusters with high similarity. Additionally, when biologically meaningful 
labels for clusters are known for a sample, a population matching algorithm can label 
clusters from another sample, thus solving the population labeling problem as well. 

In conventional FC practice, population registration is often performed by map¬ 
ping 2-D projections of clusters. However, the visual mapping is inadequate for high¬ 
dimensional data. Recently two different types of algorithms have been proposed in 
order to solve the population registration problem automatically and efficiently. In 
the hrst approach, the centers of different clusters are “meta-clustered” (cluster of 
clusters) and the clusters whose centers fall into same meta-cluster are marked with 


same label 23 . The second approach computes a biologically meaningful distance 
between each pair of clusters across samples and then matches similar clusters by 
using a combinatorial matching algorithm l27l|^. I developed an algorithm of the 


second type, called the mixed edge cover (MFC) algorithm 36,37. The MEC algo¬ 
rithm uses a robust graph-theoretic framework to match a cluster from a sample to 
zero or more clusters in another sample and thus solves the problem of missing or 
splitting populations as well. I discuss more about the cluster matching algorithms 
in Chapter]^ 


1.3.6 Meta-clustering and templates 

A flow cytometry dataset often consists of samples belonging to a few represen¬ 
tative classes representing multiple experimental conditions, disease or vaccination 
status, time points etc. Towards this end, I assume that samples belonging to a 
particular class are more similar (given a biologically meaningful similarity measure) 
among themselves than samples from other classes. In this scenario it is more effi- 
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cient to summarize a collection of samples with a few representative templates, each 


template representing samples from a particular class 23,27,36 . Thereby, overall 
changes across multiple conditions can be determined by comparing the cleaner and 
fewer class templates rather than the large number of noisy samples themselves. 

A template is usually constructed by matching similar cell clusters across samples 
and combining matched clusters into meta-clusters. Clusters in a meta-cluster rep¬ 
resent the same type of cells and thus have overlapping distribntions in the marker 
space. Therefore, the meta-clusters represent generic cell populations that appear in 
most samples with some sample-specihc variation. A template is a collection of rela¬ 
tively homogeneous meta-clusters commonly shared across samples of a given class, 
thus describing the key immunophenotypes of an overall class of samples in a formal, 
yet robust, manner. 

In my dissertation, I have developed a hierarchical matching-and-merging 
(HM&M) algorithm that builds templates from a collection of samples by repeat¬ 
edly merging the most similar pair of samples or partial templates obtained by the 


algorithm thus far 36 . A meta-cluster within a template represents a homogeneous 
collection of cell populations and acts as a blneprint for a particnlar type of cell. 
However, traditional nnll hypothesis based signihcance testing (e.g., F-test or paired 
t-test) often has a high probability of making a Type I error when used to evaluate 
the homogeneity of a meta-cluster because of large cluster sizes. Hence, to evalnate 
meta-cluster homogeneity, I propose the ratio of between-cluster to within-cluster 
variance (relative cluster separation, 0), in a MANOVA model, as an alternative 
method to evalnate the homogeneity of a meta-cluster. I will discuss meta-clustering 
and template construction algorithm in detail in Chapter 


1.3.7 Classifying FC samples 

Besides their use in high-level visualization and cross-class comparisons, templates 
can be employed to classify new samples with unknown status. Templates work 
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as prototypes of different biological classes, e.g., disease status, time points etc., 
by emphasizing the common properties of the class while omitting sample-specihc 
details. A new sample with unknown class label is predicted to come from a class 
whose template the sample is most similar to. The template-based classihcation is 
robust and efficient because it compares samples to cleaner and fewer class templates 
rather than the large number of noisy samples themselves. While classifying new 
samples, the templates can be dynamically updated to incorporate the information 
gained from the classihcation of the new samples. This approach makes it possible 
to summarize the data from each laboratory using templates for each class, and 
then to merge the templates and template-trees across various laboratories, as the 
data is being continuously collected and classihed. I will discuss the template-based 
classihcation algorithms in detail in Chapter 


1.4 The need for automation in FC data analysis 


FC data is large, continuous, high-dimensional, and perturbed by Poisson and 
Gaussian noise as we measure various features of individual cells for millions of intact 
cells in a sample. Traditionally, after some semi-automated preprocessing, cell pop¬ 
ulations are identihed by visualizing cells as a collection of two-dimensional scatter 
plots. The reader will hnd it helpful to refer the Figure 3.3 b) in Chapter for an 
example where four types of immune cells are identihed using hve cluster of diheren- 
tiation (CD) proteins. A trained operator then visually compares these biaxial plots 
from diherent samples, registers populations across samples from multiple conditions 
(healthy vs. disease for example) and studies the diherences across conditions to 
extract necessary biological information. 

Recent advances in FC technology pose challenges to the traditional manual anal¬ 
ysis in three aspects: (1) high-dimensionality of data, (2) large volume of data, and 
(3) compute-intensive analysis. First, cell populations dehned in higher dimension 
are difficult to locate in 2-D projections. Furthermore, biaxial plots assume that the 
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axes are orthogonal to each other; however, often there are correlations between these 
protein markers, and these are difficult to analyze using biaxial projections. Second, 
it is laborious and often infeasible to manually analyze an FC dataset with hundreds 
of samples, each with millions of multi-parametric cells. Third, several analysis steps 
require numerical calculations such as optimization of nonlinear functions and solu¬ 
tions of matrix equations. However, a collection of tools designed to perform specihc 
functions but not to work together does not improve the overall analysis of FC data, 
because a signihcant amount of time is spent in hnding appropriate tools for each 
step, in identifying optimum parameters for the selected tools and, in processing 
data between different steps. Therefore, to prevent the data analysis from being the 
bottleneck in scientihc discovery based on cytometry, an automated and systematic 
cytomics pipeline is necessary. 

Even though a rich set of computational tools is available for other fluorescence- 


based technologies such as the microarrays 38 , they can not be directly applied - as a 
black box - to analyze FC data because of the technological differences. Microarrays 
measure the expression of a large number of genes under different conditions, whereas 
FC measures a smaller number of proteins characteristic of a few immunophenotypes, 
across a large number of samples. As a result, the objectives of various analysis steps 
are often different for these two technologies. For example, between-samples vari¬ 
ances across a large number of genes are stabilized in microarray, whereas I stabilize 
within-population variances across a collection of samples in FC. Another example 
of their difference is in data clustering, where multi-parametric cells are clustered 
to identify functional cell populations in FC; in contrast, genes are clustered based 
on their expression patterns in microarray. In summary, the nature of the data, its 
pre-processing, statistical treatment, and algorithms for downstream analysis are all 
substantially different for FC and other fluorescence-based technologies. Therefore, 
a customized pipeline adapted to the properties of FC data is necessary, especially 
to keep up with the ever increasing dimensionally and large number of samples gen¬ 
erated routinely in FC experiments. To address this need, recently there has been 
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an influx of automated tools for analyzing FC data [^[^|^|^. The algorithms 
and software developed in my PhD work are the latest additions to the attempt of 
automating FC data analysis. 


1.5 Contributions 

This dissertation has four major contributions to computational cytometry: (a) 
algorithms, (b) software, (c) biological applications, and (d) publications. 

Algorithms: This dissertation addresses the need to analyze the large volume of 


multi-parametric FC data by developing an algorithmic pipeline called flowMatch 41 


The pipeline contains six algorithmic modules performing six steps of FC data anal¬ 
ysis: (1) spectral unmixing to remove the effect of overlapping fluorescence channels, 
(2) stabilizing variance to decouple signals from noise, (3) robust clustering of cells 
to identify phenotypic populations, (4) registering cell clusters across samples from 
multiple conditions, (5) constructing templates by preserving the common expression 
patterns across samples, and (6) classifying samples based on the measured pheno¬ 
types. In addition to these six steps, the pipeline has several helper functions for 
preprocessing and visualizing FC data. 


Figure 1.6 displays the schematic view of six functional modules of the flowMatch 
pipeline. An FC sample is represented by an n x p matrix, where n is the number of 
cells and p is the number of features (physiological properties and expression levels of 
markers) measured in each cell. Subhg. 1.6[ 1) shows the overlap of two spectra (green 
and yellow) emitted by two fluorochromes. The yellow band pass hlter captures a 
signihcant amount of signal from the green spectrum, which must be compensated 


e.g., by Eq. 1.2) to correctly reconstruct the signal form the yellow fluorochrome. 


Subhg. 1.6 2) displays the density plots of a single marker from several samples of 
a dataset after stabilizing the variance. Observe that the variances (width) of the 
density peaks (both positive and negative peaks across samples) are nicely stabilized. 


Subhg. 1.6 3) shows the application of a clustering algorithm to a three dimensional 
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An FC sample 
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Figure 1.6. An FC data analysis pipeline. (1) unmixing of spectra to re¬ 
move the effect of overlapping fluorescence channels, (2) stabilizing vari¬ 
ances by transformation and normalization, (3) identifying cell popula¬ 
tions by automated clustering algorithm in a high-dimensional marker 
space, (4) registering cell populations across samples with a matching al¬ 
gorithm, (5) representing a class of samples with high-level templates, and 
(6) classifying samples by using the templates. 


sample, where different colors denote four phenotypically distinct cell populations. 


In Subhg. 1.6 4), I show how functionally similar cell clusters are matched by a 
population registration algorithm, where same colors are used to mark the matched 


clusters. The next Subhg. 1.6 5) illustrates how a template is created by a hierarchical 
algorithm from six samples belonging to the same class. Finally, the classihcation 
of a new sample based on its similarity with two class templates is described in 


Subhg. 1.6'6). 

Software: I have developed algorithms for the last hve steps of the pipeline except 
the spectral unmixing step. I included spectral unmixing and other related prepro¬ 
cessing tools from related packages to make this pipeline as complete as possible. I 


have developed an R package flowMatch 41,42 by implementing the steps discussed 


in Fig. |1.6 in both R and C-I--I- programming languages. This package has been made 
available through Bioconductor (http://www.bioconductor.org/). I hope that 
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flowMatch will be a useful addition to the available FC data analysis tools and can 
contribute to faster analysis of large FC datasets. 

Biological applications: The flowMatch pipeline contains several combinatorial 
and statistical algorithms used to perform different steps of data analysis. The com¬ 
plete analysis is usually divided into a series of encapsulated sub-problems and each 
of them is solved by a module of the pipeline. I demonstrated the application of 
different steps of the pipeline with three data sets: healthy donor data, T cell phos¬ 
phorylation data, and acute myeloid leukemia (AML) data. The healthy donor (HD) 
dataset consists of 65 hve-dimensional samples from hve healthy individuals who do¬ 
nated bloods on different days. I use this dataset to demonstrate that by stabilizing 
within-cluster variance we are able to construct homogenous meta-clusters despite the 
presence of biological, temporal, and technical variations. The T cell phosphorylation 
(TCP) dataset consists of 29 pairs of samples before and after stimulation with an 
anti-CD3 antibody [^. I analyzed the data in order to demonstrate that templates 
can be created from multiple conditions and the meta-clusters can be matched across 
experimental conditions (before and after stimulation) to assess the overall changes in 
phosphorylation experiment. The acute myeloid leukemia (AML) dataset consists of 
samples from 43 AML patients and 217 healthy individuals. I used flowMatch pipeline 
to build healthy and AML templates, to identify AML markers, and to classify AML 


samples by comparing test samples against the two templates 45 . I describe the 
datasets in Sectioi jL^ in more detail. In the past, I have also tested different com¬ 
ponents of flowMatch to solve problems in evaluating HIV vaccination success, and 
detecting correlations among multiple sclerosis (MS) treatments. 

Publications: The algorithms with their biological applications developed in this 


dissertation have been published to several peer-reviewed journals 32,3m (Nature 


Methods, BMC Bioinformatics) and conferences 37 ^ (e.g., WABI, ACM BCB, 


GLBIO, SIAM LS, Cyto, etc.). Several papers are submitted for publication 41,45 


The flowMatch pipeline was one of the top performers to solve several challenges 
designed by the FlowCAP consortium at National Institute of Healthy (NIH). I have 
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developed several multithreaded algorithms for computing the maximum cardinality 
matching in large graphs, which are published in several conference proceedings (SC, 


IPDPS, IPDPSW) 49 . Aside from this dissertation, I have also developed a 
robust variant of residual resampling technique for computing the uncertainty in 
evolutionary trees and illustrated its use with an analysis of genome-scale alignments 


of yeast 50,51 


Difference from related work: flowMatch is similar to the GenePattern Flow Cy¬ 
tometry Suite in terms of the coverage of distinct algorithms for different analysis 
steps. However, these two pipelines are signihcantly different from each other on how 
each step of the pipeline is performed. For example, GenePattern normalizes FC sam¬ 
ples by aligning density peaks on each channel as described by Hahne et al. [^. In 
contrast, flowMatch stabilizes variances of the density peaks on each channel without 
shifting the signals. Likewise, other steps of the flowMatch pipeline are signihcantly 
different from the corresponding steps in the GenePattern Flow Cytometry Suite. 

The primary difference between howMatch and other FC data analysis tools (dis¬ 


cussed in Section 1.7) is that I consider a collection of FC sample related to each other 
and analyze them collectively. For example, I characterize a group of similar samples 
with representative templates and use templates in between-class comparison, clas- 
sihcation, and other overall biological conclusions. In contrast, most of the existing 
tools analyze samples individually and only make sample-specihc conclusions. Finally, 
howMatch is not an exhaustive pipeline and I plan to include other functionalities 
into the pipeline in future. 


1.6 Datasets used in this thesis 
1.6.1 Healthy donor (HD) dataset 

The healthy donor (HD) dataset represents a “biological simulation” where pe¬ 
ripheral blood mononuclear cells (PBMC) were collected from hve healthy individuals 
on up to four diherent days. Each sample was divided into hve parts and analyzed 
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through a flow cytometer at Purdue’s Bindley Biosciences Center. Thus, we have flve 
technical replicates for each sample from a subject, and each replicate was stained 


using labeled antibodies against CD45, CDS, CD4, CDS, and CD19 protein markers. 


I show a summary of the HD dataset in Fig. 


1.7 


Channel Name 

Function 

FS 

Forward Scatter 

SS 

Side Scatter 

CD45 

Leukocyte marker 

CD3 

T lymphocyte marker 

CD19 

B lymphocyte marker 

CD4 

Helper! cell marker 

CDS 

Cytotoxic T cell marker 


Subjects 

Number of 
days blood 
collected 

Total samples 
per subject 

Total 

samples 

Subject A 

3 

15 

65 

Subject B 

1 

5 

Subject C 

4 

20 

Subject D 

3 

15 

Subject E 

2 

10 


Figure 1.7. Description of the healthy donor (HD) dataset. The distri¬ 
bution of samples across subject is shown in the left and the functions of 
scatter and fluorescence channels are described in the right. 


The HD dataset includes three sources of variations: (1) technical or instrumental 
variation among replicates of the same sample, (2) within-subject temporal (day-to- 
day) variation, and (3) between-subject natural or biological variation. The dataset 
contains 13 replicated groups, with each group containing flve copies of the same 
sample. Hence, the variation among flve replicates within each replicated group orig¬ 
inates from the technical variation in sample preparation, instrument calibration and 
sample measurement in the flow cytometer. The within-subject temporal variation 
reflects the environmental impact on the immune system on different days when the 
blood was drawn from a subject. Finally, samples from different subjects have natural 
between-subject variations. Different sources of variations present in the HD dataset 
mate it an ideal case to demonstrate the functionality of the flowMatch pipeline. 


1.6.2 T cell phosphorylation (TCP) dataset 


I reanalyze a T cell phosphorylation (TCP) data set from Maier et ah 44 to deter¬ 


mine differences in phosphorylation events downstream of T cell receptor activation 
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in naive and memory T cells. For each of the 29 subjects in this study, whole blood 
was stained using labeled antibodies against CD4, CD45RA, SLP-76 (pY128), and 
ZAP-70 (pY292) protein markers before stimulation with an anti-CD3 antibody, and 
another aliquot underwent the same staining procedure hve minutes after stimulation. 
The hrst two markers (CD4, CD45RA) are expressed on the surface of different T 
cell subsets and the last two (SLP-76 and ZAP-70) are highly expressed after T cells 


are phosphorylated 44 


During the stimulation anti-CD3 antibody binds with T cell receptors (TCR) and 
activates the T cells, initiating the adaptive immune response. The binding with 
TCR induces dramatic changes in gene expression and cell morphology, and induces 
the formation of a phosphorylation-dependent signaling network via multi-protein 
complexes. ZAP-70 is a kinase that phosphorylates tyrosine in a trans-membrane 
protein called LAT, and LAT and SLP-76 are part of a platform that assembles 


the signaling proteins 53 . I reanalyzed this dataset in order to demonstrate that 


computationally meta-clusters are preserved across experimental conditions (before 
and after stimulation) when the homogeneity of meta-clusters is preserved. 

The AML dataset is discussed in Chapter 

1.7 Related work 


Recent advances in FC technology have led to an influx of automated tools to 


analyze FC data 27,32,39,40 . However, many of the automated tools solve a small 


subset of the problems arising in FC data analysis. For example, spectral unmix¬ 


ing (compensation) is discussed in 13 -15 and various data transformation methods 
are discussed in lilig [^[^. Automated gating or clustering is arguably the most 


discussed problem in FC data analysis 26,27,31,32 and Aghaeepour et ah 32 pro 


vides a state-of-the-art summary of the held. Cluster matching and meta-clustering 


have been studied only recently by 23,27, but have received relatively less atten¬ 


tion from the researchers. Other general-purpose and problem-specihc methods are 
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also discussed in the literature 54-58 . A number of the aforementioned tools are 


implemented as free, open-source R packages such as flowCore 59 , flowViz |60], flow- 


Clust 61 , flow Trans 23 , flowStats 62 , flowType 54 and other packages available 


through Bioconductor 43 


Recently Spidlen et ah introduced a web-based FC data analysis pipeline called 
“GenePattern Flow Cytometry Suite” (http://www.genepattern.org) that includes 34 
open-source modules performing preprocessing, normalization, gating, cluster match¬ 


ing and other post-processing steps 39 . flowMatch is similar to the GenePattern 
Flow Cytometry Suite in terms of the problem it solves, but signihcantly different on 
how each step is performed. For example, GenePattern normalizes FC samples by 


aligning density peaks on each channel as described by Hahne et ah 52 . In contrast, 
I stabilize variances of the density peaks on each channel without shifting the signals. 
Likewise, other steps of the flowMatch pipeline are significantly different from the 
corresponding steps in the GenePattern Flow Cytometry Suite. 

To the best of my knowledge, there is no other software tool that would pro¬ 
vide a comprehensive pipeline for FC data analysis. However, a few software tools, 
most of them commercial, integrate one or two steps of the complete analysis 
pipeline. For example, the Immunology Database and Analysis Portal (ImmPort, 
https://immport.niaid.nih.gov) integrates the FLOCK software that performs nor¬ 


malization and density based clustering 63 . Cytobank 40 is a web-based applica¬ 
tion focusing mainly on the organization and storage of cytometry data. Finally, most 
of the major commercial third party software vendors, including Tree Star, De Novo 
Software, and Verity Software House, offer computational tools for certain steps of the 
FC data analysis. However, all these tools solve only a few steps of the overall com¬ 
putational analysis with limited support for the other steps. I hope that flowMatch 
will be a useful addition to the available FC data analysis tools and can contribute 
to faster analysis of FC datasets. 
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1.8 Outline of the thesis 

This thesis describes various computational aspects of analyzing flow cytometry 
data. I have developed a systematic pipeline, flowMatch containing hve functionally 
distinct modules for analyzing large volume of multi-parametric FC data. These hve 
algorithmic modules are discussed in hve chapters of this thesis. In Chapter I 
describe flow VS, a method for stabilizing variance in FC data by transforming each 
channel with nonlinear functions. Chapter discusses several basic and two con¬ 
sensus clustering algorithms and explains how multiple cluster validation indices can 
simultaneously be optimized to select the “best” clustering algorithm and algorith¬ 
mic parameters. Chapter discusses a robust mixed edge cover (MFC) algorithm 
for registering (labeling) cell clusters across samples. The next Chapter discusses a 
hierarchical matching-and-merging (HM&M) algorithm that summarizes a collection 
of similar samples with templates consisting of several homogeneous meta-clusters. 
I discuss how the templates are used to classify new samples in a dynamic fashion 
in Chapter Chapter discusses algorithms for classifying and immunophenotyp- 
ing the acute myeloid leukemia (AML). Concluding remarks and future directions of 
research are presented in Chapter 
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2 VARIANCE STABILIZATION IN FLOW CYTOMETRY 


2.1 Introduction 


In this chapter, I discuss flow VS - a novel method for stabilizing within-population 
variances across flow cytometry (EC) samples. flowVS stabilizes variance by trans¬ 
forming EC samples with the inverse hyperbolic sine (asinh) function whose parame¬ 
ters are optimized to homogenize the within-population variances. Variance stabiliza¬ 
tion (VS) is a data-transformation process for dissociating data variability from the 
mean values. In flow cytometry, the purpose of VS is to decouple biological signals 
(usually measured by average marker expressions of cell populations) from different 
sources of variations and noise so that only biologically signihcant effects are detected. 

Variance inhomogeneity is an inherent problem in fluorescence-based EC mea¬ 
surements and is a bottleneck for automated multi-sample comparisons. The origin 
of the problem is the fluorescence signal formation and the detection process that 
monotonically increases the variance of the fluorescence signal with the average sig¬ 


nal intensity 15,16 . Due to such signal-variance dependence, a cell population (a 
cluster of cells with similar marker expressions) with higher level of protein expressions 
(i.e., higher fluorescence emission) has higher variance than another population with 
relatively low level of protein expressions (i.e., low fluorescence emission). This in¬ 
homogeneity of within-population variance creates problems with extracting features 
uniformly and comparing cell populations with different levels of marker expressions. 
Furthermore, detecting statistically signihcant changes among populations, such as in 
an analysis of variance (ANOVA) model, explicitly requires that variance be approx¬ 
imately stabilized in populations. By removing mean-variance dependence from EC 
data, VS makes it possible to detect biologically and statistically meaningful changes 
across populations from different samples. 
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VS is an explicit preprocessing step for other flnorescence-based technologies snch 


as the microarrays 22,64-66 . However, nnlike microarray data, explicit VS is of¬ 
ten not performed in FC data analysis. Traditionally, FC data is transformed with 
nonlinear fnnctions to project cell popnlations with normally distribnted clnsters - a 
choice that nsnally simplihes snbseqnent visnal analysis pT| - [^[2^[2^ . Recently Fi- 
nak et ah nsed the maximnm likelihood approach [2^ with different transformations 


to explicitly satisfy normality of the cell populations. Ray et ah 24 transformed 
each channel with the asinh function whose parameters are optimally selected by the 
Jarque-Bera test of normality (a goodness-of-£t test of whether sample data have the 
skewness and kurtosis matching a normal distribution). While these transformations 
approximately normalize FC data, they might not stabilize the variance. 

The VS problem in FC, however, cannot be solved directly by applying the mature 
VS techniques from the microarray literature. In microarrays, each gene is measured 
multiple times (possibly under multiple conditions) and the between-sample variance 
for each gene is stabilized with respect to the average expression of the gene across 
samples. In contrast, variance is dehned by within-population, cell-to-cell variation 
in FC and this within-population variance is stabilized with respect to the average 
expression of markers within each population. These contrasting objectives prevent 
us from applying VS methods from microarray literature directly to flow cytometry. 

I address the need for explicit VS in FC with a maximum likelihood (ML) based 
method, flow VS, which is built on top of a commonly used asinh transformation. 
The choice of asinh function is motivated by its success as a variance stabilizer for 


microarray data [^,66 . flowVS stabilizes the within-population variances separately 
for each marker (fluorescence channel) 2 ; across a collection of N samples. After 
transforming ^ by asinh(z/c) (c is a normalization cofactor), flowVS identihes one¬ 
dimensional clusters (density peaks) in the transformed channel. Assume that a 
total of m 1-D clusters are identihed from N samples with the cluster having 
variance af. Then the asinh(z/c) transformation works as a variance stabilizer if 


the variances of the 1-D clusters are approximately equal, i.e., erf ~ ~ 


cz 
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To evaluate the homogeneity of variance (also known as homoskedasticity), I use 


Bartlett’s likelihood-ratio test 67 . From a sequence of cofactors used with the asinh 
function, flow VS selects one with the “best” VS quality computed by Bartlett’s test. 
flow VS is therefore an explicit VS method that stabilizes within-cluster variances in 
each marker/channel by evaluating the homoskedasticity of clusters with a Likelihood- 
ratio test. 

I show, with a hve-dimensional healthy dataset, that flow VS removes the mean- 
variance dependence from raw FC data and makes the within-population variance 
relatively homogeneous. Such variance homogeneity is especially useful to build meta¬ 
clusters from a collection of phenotypically similar cell populations across samples. 


Previous studies (Hahne et ah 52 , for example) shifted the distribution of each 
fluorescence channel to ensure homogeneity in meta-clusters, but such artihcial shift¬ 
ing may hide useful biological signals. By contrast, flowVS builds homogeneous 
meta-clusters from variance-stabilized clusters without losing the biological differ¬ 
ences among samples. I will discuss the impact VS on comparisons among samples 
(with related concepts of meta-clusters and templates) in Chapter 

on 


The rest of the chapter is organized as follows. I start with a short Section 2.2 


related work and the current transformation techniques employed in FC. In Section 


2.3 I describe flowVS, a method for stabilizing variance of FC data. The next Section 


2.4 describes applications of this VS technique to an FC dataset and a microarray 


dataset. I conclude this Chapter in Section [23] by discussing limitations of flowVS 
and possible future work. 


2.2 Related work 

From the beginning of the twentieth century, VS has been widely studied for its 
central role in making heteroskedastic data easily tractable by standard methods. 
Heteroskedasticity appears in various datasets mostly because the data follows a 
distribution with correlated mean and variance, e.g., the Poisson distribution. For 
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well-known distribution families, VS is usually performed by transforming data with 
an analytically chosen function /. For example, f{z) = ■\/z + 3/8 works as a good 
(asymptotic) stabilizer for a random variable 2 ; following the Poisson distribution 


[68 

in 


Variance stabilizers for several well-known distribution families are described 


68 1 69 . For unknown distributions, heuristic and data-driven stabilizers are often 


used, e.g., see 21,70,71 


However, traditional transformations are often inadequate for low-count (photon 


limited) signals 22,72 because of unknown error patterns in fluorescence data. Hence 
various ad hoc variance stabilization schemes have been developed for different types 
of fluorescence data. In microarrays, the VS problem has been addressed by vari¬ 


ous non-linear transformations 22,64-66 . Most notably, the widely used approach 


by Huber et ah 22 uses the inverse hyperbolic sine (asinh) transformation whose 
parameters are selected by a maximum-likelihood (ML) estimation. 

For flow cytometry data, researchers have used various non-linear transformations, 
such as the logarithm, hyperlog, generalized Box-Cox, and biexponential (e.g., logicle 


and generalized arcsinh) functions 17-20,23,26 . In the past work, the parameters 
of these transformations were adjusted in a data-driven manner to maximize the 
likelihood {flow Trans by Finak et ah [^), to satisfy the normality {flowScape by Ray 
et ah [^), and to comply with simulations {PCSTrans by Qian et ah [^). flowTrans 
estimates transformation parameters for each sample by maximizing the likelihood of 
data being generated by a multivariate-normal distributions on the transformed scale. 
flowScape optimizes the normalization factor of asinh transformation by the Jarque- 
Bera test of normality. FCSTrans selects the parameters of the linear, logarithm, 
and logicle transformations with an extensive set of simulations. In contrast to these 
approaches that transform a single sample, flow VS transforms a collection of samples 
together for stabilizing within-population variations. Note that normalizing data may 
not necessarily stabilize its variance, e.g., for a Poisson variable z, sjz + ?>/S is an 
approximate variance-stabilizer, whereas is a normalizer 


21 
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2.3 Variance stabilization for flow cytometry data 
2.3.1 The goal of variance stabilization 


The aim of variance stabilization (VS) in FC is to make within-popnlation vari¬ 
ances of different cell popnlations approximately eqnal and thereby independent of 
the average protein expressed by popnlations. Recall that the expression of a protein 
is measnred by the intensity of a channel captnring the flnorescence of a particnlar 
wavelength. VS therefore stabilizes the within-popnlation flnorescence variance and 
makes it independent of the mean flnorescence intensities (MFI) of the cell popnla¬ 
tions. In this context, I nse the terms “a flnorescence channel” and “a protein marker” 
eqnivalently becanse we infer information abont the latter throngh the signal collected 
at the former. However, I will nse “flnorescence channel” more freqnently becanse the 
natnre of flnorescence emissions - not the protein expressions - dominates the mean- 
variance relationship in FC data. I do not stabilize variance on the scatter channels 


becanse, as pointed ont by Finak et ah 23 , there are few benehts to transforming 
forward and side scatter channels. 


2.3.2 Channel-specihc variance stabilization 


I assume that compensated flnorescence channels are independent and stabilize 
variance on each channel (a colnmn of the data matrix) separately. Besides be¬ 
ing simple, one-dimensional VS prevents nnnecessary correlation among transformed 
channels incnrred by mnlti-dimensional VS. Note that the correlations among flnores¬ 
cence channels dne to spectral overlap are compensated before we stabilize variance. 


Even thongh the protein expressions can still be correlated 18 , I do not inclnde snch 


correlations in the VS process becanse the natnre of snch problem-specihc correlation 
is difficnlt to model. 


Since the actnal error model of FC data is nnknown, it is not trivial to select 
a fnnction to transform this data. Even thongh researchers have stndied a nnmber 
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of normalization functions, they are often selected arbitrarily 23,24. Considering 
the similarity in fluorescence-based data collections between FC and microarrays, I 
decided to use an inverse hyperbolic sine (asinh) function that has been shown to 


successfully stabilize variance in fluorescence readouts from microarray data 22,66 


This choice of asinh function is also motivated by its success in FC data visualization 


and normalization 23,24. Stabilizing variance with other transformations can be 
performed using the same flowVS framework but is not discussed here. 

To transform a fluorescence channel z, I use the asinh transformation with a single 
parameter c: 

asinh( 2 ;/c) = \\i{z/c-\- \/ {z/cY + 1). (2.1) 

In this transformation, c is called the normalization cofactor whose value is optimally 
selected to stabilize variance in channel z. Let cr?- be the variance of the cluster 
dehned in channel/marker z in the sample. My objective is to select a cofactor 
for the asinh transformation such that after transforming channel 2 ; in each sample, 
variance of every cluster becomes approximately homogeneous. 

In a more general form, asinh transformation is expressed with three parame¬ 
ters, a * asinh(6 + z/c), where in addition to the cofactor c, a denotes a scaling after 
transformation, and b denotes a translation before transformation. I set a = 1 since 
other values do not affect downstream analysis, and 6 = 0 to avoid shifting of cell 
populations. Hence I am left with a single parameter, c, that I aim to set in order to 
stabilize the variance. Note that other transformations such as logarithmic, hyper¬ 


log 17 , logicle 18 , Box-Cox etc., may also stabilize the variance but a detailed 
analysis covering all these transformations is out of the scope of this study. 


2.3.3 flowVS: an algorithm for per-channel variance stabilization 

Given a collection of N FC samples, the flowVS algorithm stabilizes the variance 
in a fluorescence channel z with the following steps. 
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1. Selecting a sequence of cofactors: I select an evenly spaced increasing seqnence 
of cofactors Ci, C 2 ,Q to be nsed with the asinh transformation. The start and 
end valnes (ci and q) are empirically selected so that the seqnence inclndes a 
variance stabilizing cofactor. 


2. Transforming data and evaluating homoskedasticity for each cofactor. For each 
cofactor Cq G {ci,C 2 , the following steps are performed. 


(a) Transforming channel/marker z in each sample: Let zj be a vector de¬ 
noting the selected channel in the sample where 1 < j < iV. The 
algorithm transforms Zj by the asinh fnnction: z' = asinh(zj/cg), where Zj 
is the same channel after the transformation. 


(b) Detecting 1-D density peaks (1-D clusters): I estimate the density of each 
transformed flnorescence vector z) by a kernel density estimation method. 
The peaks in the density of z) are identihed as regions of high local den¬ 
sity and signihcant cnrvatnre (also called landmarks in [^). To identify 
the 1-D density peaks, flowVS nses the curvlFilter fnnction from the 


f lowCore package 59 in R. Here a density peak represents a 1-D clnster 


of cells and therefore can also be identihed by a clnstering algorithm 54 
Let Pj be the collection of density peaks identihed from z). 

(c) Collecting density peaks from all sample: Density peaks from all samples 
are collected into a set P snch that P = Ui<j<ArPj. Let P contain a total 
of m density peaks where the peak contains cells with mean /ij and 


variance crj. 


(d) Testing homoskedasticity: The performance of the asinh transformation 
with cofactor Cq is evalnated by a test of variance homogeneity (ho¬ 
moskedasticity). For this pnrpose, I nse Bartlett’s test [^, a well-known 
likelihood-ratio test for homoskedasticity. Assnming n = 
be the total nnmber of cells in P and cXp to be the pooled variance of m 
density peaks, I compnte Bartlett’s statistics as follows: 
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B{c,) 


(n-m) ln(a^)-E::i(^.-l) 
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' 3(m—1) ^Z—/z=l rii — l n—m J 


( 2 . 2 ) 


Bartlett’s statistics B{cq) computes the degree of homogeneity across all 
1-D clusters in channel after it is transformed by asinh( 2 ;j/cq). 


3. Finding a cofactor for optimum VS: The optimum variance stabilization is 
achieved by a transformation giving minimum value of Bartlett’s statistics. 
Therefore, the asinh transformation with the cofactor 

i 

Cq* = argmin i?(cq), (2.3) 

gives the optimum VS for the channel/marker z. Channel Zj in each sample is 
then transformed by asmh.{zj/Cq») and used in subsequent analysis. 


2.4 Results 


I demonstrate how the flow VS algorithm stabilizes variance with the HD dataset 


described in Section fl.6.1[ Recall that the HD dataset consists of 65 samples from hve 
healthy individuals who donated blood samples on different days. Each sample was 
divided into five replicates and each replicate was stained using labeled antibodies 
against CD45, CD3, CD4, CDS, and CD19 protein markers. Before variance stabi¬ 
lization each sample is preprocessed, compensated for spectral overlap, and gated on 
FS/SS channels to identify lymphocytes. Since lymphocytes always express CD45 
protein (CD45 is a common leukocyte marker), I omit this marker from the rest of 
the discussion. 


2.4.1 Selecting the optimum cofactors for the asinh transformation 

For each sample of the HD dataset, the flowVS algorithm identifies density peaks 
in CD3, CD4, CDS, and CD19 markers/channels by following step 2(b) in Sec¬ 


tion 2.3.3 In each of these four channels, the algorithm identihes two density peaks 
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Figure 2.1. Mean Fluorescence Intensity (MFI) of each 1-D cluster is plot¬ 
ted against the variance of the cluster for different choice of cofactors used 
with the asinh transformation. Clusters in each marker are shown with 
the same symbol and color, (a) no transformation: variance increases 
monotonically with the mean, (b) small cofactor: VS is not achieved, (c) 
reasonable value of cofactor (same for all channels): mean-variance depen¬ 
dence remains except for CD4 channel/marker, and (d) optimal values of 
cofactors: variance is approximately stabilized for each marker/channel. 


representing high-expressing (“positive”) and low-expressing (“negative”) cell popu¬ 
lations (e.g., CD3“, CD3’'' clusters in the CD3 marker/channel). Therefore, from the 
65 samples in the HD dataset, I obtain 130 (65 x 2) 1-D clusters for each marker, 
giving a total of 520 (65 x 2 x 4) 1-D clusters in CD3, CD4, CDS, and CD19 channels. 

For each of these 520 clusters, I compute the within-cluster variance and av¬ 


erage marker expression (Mean Fluorescence Intensity, MFI). Figure 2.1 plots the 
mean-variance relationship for every cluster (density peak) before transforming the 
channels and after they are transformed by asinh function with different cofactors. 
In this hgure, clusters identihed in channels transformed with different cofactors are 
shown in different panels and in each panel, clusters in a marker are shown with 
the same symbol and color. Subhg. 2.lKa) reveals the non-linear correlation between 




















32 



1.2e+06 

1.0e+06 

8.0e+05 

6.0e+05 

4.0e+05 

2.0e+05 



Cofactor in asinh transformation ( x 1000) 



Figure 2.2. Finding the optimum values of the cofactors in asinh trans¬ 
formation for each fluorescence channel in the HD dataset. 


cluster variance and mean, which is typically observed in raw FC data before ap¬ 
plying any transformation. The mean-variance dependence is not alleviated after 
transforming the data by asinh function with arbitrary cofactors as can be seen in 
Subhg. 2.1'b,c). Finally, Subhg. [2T| (d) shows that the variances of the 1-D clusters 
become approximately stabilized and independent of the means after the optimum 
variance stabilization is performed. On CDS marker, for example, CD3“, CDS’*" clus¬ 
ters (shown by green triangles in Subhg. |2.lK d)) have approximately stable variances 
and no visible correlation exists between the variances and the cluster means. 

The optimum cofactor for the asinh transformation is selected by minimizing 
Bartlett’s statistics separately for each of the four markers. Figure |2]^ shows Bartlett’s 
statistics computed in each channel after it has been transformed by asinh function 
with a sequence of cofactors. A minimum is observed for every channel and the 
cofactor is set to the value of the minimizer of Bartlett’s statistics. The variance 
stabilizing cofactors for different markers are: (a) 5,000 for CDS, (b) 6,000 for CD19, 
(c) 7,000 for CD3, and (d) 10,000 for CD4. Therefore, the channels of the HD 
dataset are transformed by asinh function with the optimum cofactors. As shown 
in Subhg. |2.1[ d), the optimum transformations approximately stabilize variances in 
different channels. 


I plot the density of the variance stabilized channels in Fig. |2.3 where different 
colors are used to denote the samples from hve different subjects. In Fig. both 
positive and negative density peaks (clusters with high or low marker expression) 
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Figure 2.3. Density plot of the variance stabilized channels across all 
samples in the HD dataset. Different colors used for samples from different 
subjects. 


spread approximately equally across all samples, conhrming the homogeneity of vari¬ 
ances in one dimensional clusters. For this dataset, the density curves from the same 
subjects are tightly grouped together as expected. However, clusters across subjects 
may not be well aligned due to the between-subject variations. Aligning density 


peaks across samples - as was done by Hahne et al. 52 - is not an objective of the 


VS algorithm, because such shifting of density may potentially eclipse true biological 
signals across samples. 


2.4.2 Normality of the variance-stabilized clusters 

In addition to stabilizing the variances, the transformed channels approximately 
follow the normal distribution. To see this, I draw the Quantile-Quantile plots (Q- 
Q plots) for eight 1-D clusters obtained from a representative sample in the 
HD dataset in Figure |2.4[ In each Q-Q plot, the distribution of a 1-D cluster is 
compared with the standard normal distribution by plotting their quantiles against 
each other. If a cluster is normally distributed (i.e., linearly related to the standard 
normal distribution), the points in the Q-Q plot approximately lie on a straight line. 
All eight Q-Q plots in Fig. 2.4| show linearity in their central parts, except small 
deviations at the ends, indicating that the 1-D clusters approximately follow normal 


distributions with heavier tails. As I will discuss in Section 5.3, the normality of 
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Figure 2.4. The Q-Q plots for the eight 1-D clusters obtained from a 
representative sample in the HD dataset. Every Q-Q plot shows linearity 
in the central part, except for a little deviation at the end, indicating that 
the clusters approximately follow a normal distributions with heavier tails. 





clusters is a desired condition for the analysis of variance (ANOVA) model used to 
evaluate the homogeneity of a collection of similar clusters, also known as a meta¬ 
cluster. 


2.4.3 Application to microarray data 

The VS approach based on optimizing Bartlett’s statistics can also be used to 
stabilize variance in microarray data. However, the initial steps of flow VS need to 
be adapted for microarrays. Assume that the expressions of m genes are measured 
from N samples in a microarray experiment. After transforming the data by the 
asinh function, the mean fii and variance af of the gene Qi are computed from 
the expressions of Qi in all samples, flow VS then stabilizes the variances of the genes 
by transforming data using the asinh function with an optimum choice of cofactor. 
Unlike FC, a single cofactor is selected for all genes in the microarray data. 

I have applied the modihed flow VS to the publicly available Kidney microarray 


data provided by Huber et ah 22 . The Kidney data reports the expression of 8704 
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Cofactor of asinh transformation (log scale) 



(a) Selecting optimum cofactor for flow VS 


(b) VS by different methods 


Figure 2.5. (a) For Kidney microarray data 22 , flowVS selects the opti¬ 


mum cofactor for the asinh transformation by minimizing Bartlett’s statis¬ 
tics. The cofactors are shown in the natural logarithm scale, (b) The 
standard deviation and mean of each gene from the Kidney data are plot¬ 
ted before transformation and after variance stabilization by flow VS, VSN, 
and DDHFm. Loess regression is used to smoothen the curves. 


genes from two neighboring parts of a kidney tumor by using cDNA microarray tech¬ 
nology. For different values of the cofactor, flow VS transforms the Kidney data with 
the asinh function and identifies the optimum cofactor by minimizing Bartlett’s statis¬ 


tics. Subfig. 2.5(a) shows that a minimum value of Bartlett’s statistics is obtained 
when the cofactor is set to exp(6) (~ 400). The optimum cofactor is then used with 
the asinh function to transform the Kidney data. 

I compare the VS performance of flow VS with two software packages, VSN by 


Huber et ah 22 and DDHFm by Motakis et al. 75 . Similar to flowVS, VSN uses 
asinh transformation whose parameters are optimized my maximizing a likelihood 


function 22 . DDHFm applies a data-driven Haar-Fisz transformation (HFT) [75,[76 


to stabilize the variance. Both VSN and DDHFm are developed for stabilizing variance 
in microarray data and can not be applied to flow cytometry data. 

Before transforming the Kidney data and after transforming it by flowVS, VSN, 


and DDHFm, I plot the mean and standard deviation of every gene in Subhg. 2.5(b) 
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Figure 2.6. Variance stabilization of the Kidney microarray data by 
(a) flow VS and (b) VSN j^. Each black dot plots the standard deviation 
of a gene against the rank of its mean. The red lines depict the running 
median estimator. If there is no mean-variance dependence, then the red 
lines should be approximately horizontal. 


In this hgure, I have applied a loess regression to obtain smooth average curves. We 


observe in Subhg. 2.5(b) that the standard deviation of the untransformed Kidney 
data increases monotonically with the mean. Both VSN and flow VS approaches 
stabilize the variance approximately for all genes in this data. However, the Haar- 
Fisz transformation achieves good VS properties only for genes with higher levels of 
expressions. 

To take a closer look at the transformed data by flowVS and VSN, I plot the 


variances of the genes against the ranks of the means with two subhgures in Fig. 2.6 


These hgures are generated by meanSdPlot function from the VSN package. Here, 
the ranks of the means distribute the data evenly along the x-axis and thus make it 
easy to visualize the homogeneity of variances. Both VSN and flowVS remove the 
mean-variance dependence since the red lines are approximately horizontal in both 


Subhg. |2.6K a) and Subhg. 2.6 b). Therefore, flowVS performs equally well compared 
to the state-of-the-art approach developed for the microarray data. 
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2.5 Conclusions 

I describe a variance stabilization framework, flow VS, that removes the mean- 
variance correlations observed in cell populations from flow cytometry samples. This 
framework transforms each marker/channel by the asinh function whose normaliza¬ 
tion cofactor is optimally selected by Bartlett’s likelihood-ratio test. I show, with 
a hve-dimensional healthy dataset consisting of 65 FC samples, that flowVS re¬ 
moves the non-linear mean-variance dependence from raw FC data and makes the 
within-population variances relatively homogeneous across all populations, flow VS 
also performs comparably to the state-of-art variance stabilization approaches for the 
microarray data. 

Variance homogeneity (homoskedasticity) is a desirable property for comparing 
populations across conditions, building meta-clusters from phenotypically similar pop¬ 
ulations, and analyzing meta-clusters in an ANOVA model. However, unlike the 
earlier approach by Hahne et ah [^, flowVS does not artihcially shift populations 
to align them in the marker space. By stabilizing the variances, flowVS homoge¬ 
nizes similar cell populations and establishes the foundation of biologically meaningful 
meta-clusters and templates as will be discussed in Chapter 

The VS framework presented here has several limitations. First, flowVS stabilizes 
variance separately in each marker/channel. This approach is inadequate to stabilize 
covariances across multiple channels, which is necessary when channels are correlated. 
Second, flowVS repeatedly identihes 1-D cell clusters (density peaks) and evaluates 
the homogeneity of clusters by the likelihood-ratio test. Therefore, this framework 
does not perform well when cell clusters are not easily identihable. Third, flowVS 
stabilizes variance more accurately when more samples are simultaneously passed to 
the framework. Hence, the approach is not suitable for normalizing a single sample or 
stabilizing variances of sequentially arriving samples. Note that we stabilize between- 
sample variances in microarray data; therefore, VS can not performed on a single 
microarray sample. Finally, Bartlett’s test used in flowVS assumes that the deviation 
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from normality is relatively modest. If data deviates significantly from normality, 
other likelihood ratio tests can be employed, such as Levenes test or the Brown- 
Forsythe test (^. However, I tested flowVS with several FC datasets and in every 
case Bartlett’s test outperforms the Levenes and Brown-Forsythe tests in selecting 
variance stabilizing trnasformations. 

flow VS operates as an independent module in the FC data analysis pipeline. It 
does not depend on the preprocessing algorithms applied before VS nor on the post- 
analysis methods such as matching, meta-clustering, and classihcation algorithms. 
Hence, flow VS is capable of working with other downstream algorithms developed 
by other researchers, such as FLAME by Pyne et al. and flowTrans by Finak et 
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3 POPULATION IDENTIFICATION BY CLUSTERING ALGORITHMS 


3.1 Introduction 


In this chapter, I discuss clustering algorithms for identifying phenotypically dis¬ 
tinct cell clusters in a flow cytometry (EC) sample. These algorithms partition a 
multidimensional sample into several phenotypic clusters such that cells within a 
cluster are biologically similar to each other but distinct from those outside the clus¬ 
ter. A phenotypic cell cluster usually represents a particular sub-type of cells with 
specihc biological function and often called a cell population in flow cytometry. Ana¬ 
lyzing cellular functions at the level of populations instead of individual cells conveys 
more robust biological information because of the natural variations within cells of 
the same type. Therefore, identifying cell populations from the mixture of different 
cell-types turns into an important step in any EC data analysis pipeline. 

Traditionally, cell populations are identihed by a manual process known as “gat¬ 
ing”, where cell clusters are recognized in a collection of two-dimensional scatter 
plots (see Fig. |3.3[ b) for an example). However, with the ability to monitor a large 
number of protein markers simultaneously and to process a large number of sam¬ 
ples with a robotic arm, manual gating is not feasible for high-dimensional or high- 
throughput EC data. To address the gating problem, researchers have proposed 


several automated clustering algorithms, such as CDP 25 , FLAME |27|, FLOCK 28 

flowMeans 


flowClust/Merge 26,79 


31 , MM 80 , curvHDR 81 , SamSPECTRAL 30 


SWIFT 82 , RadialSVM 83 , etc. For a summary of these algorithms, see Table 1 


of 32 . Aghaeepour et ah 32 provides a state-of-the-art summary of the held. 


Different algorithms perform better for different EC datasets as was 
observed in a set of challenges organized by the FlowCAP consortium 
(http://flowcap.flowsite.org/) [^. Given the large number of algorithmic op- 
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tions, it is often difficult to select the best algorithm for a particular dataset. When 
the “ground truth” or “gold standard” about the clustering pattern is unavailable, 
we can evaluate the quality of a clustering solution with the cluster validation meth¬ 


ods 33,34 . The validation methods evaluate how well a given partition captures the 
natural structure of the data based on information intrinsic to the data alone. They 
can be used in selecting the optimum parameters for a clustering algorithm (e.g., the 
optimum number of clusters), as well as choosing the best algorithm for a dataset. 

I describe several cluster validation methods that can evaluate the quality of the 
clustering solution given by an algorithm. In general terms, there are three major 
cluster validation approaches based on external, internal, and relative validation cri¬ 


teria 33,34 . The internal validation techniques evaluate how well a given partition 
captures the natural structure of the data based on information intrinsic to the data, 
and therefore suitable to select optimum parameters for a clustering algorithm. I 
discuss hve popular internal cluster validation methods and show that they can be 
simultaneously optimized to select the algorithm with the best performance on an 
FC sample as well as the parameters with the algorithm. 

If an agreement among the validation methods can not be reached when choosing 


an algorithm, we use a consensus of several clustering solutions 32,84. I present two 
heuristic algorithms that compute consensus clusterings from a collection of partitions 
(cluster ensembles). The hrst heuristic approach is called Clue developed by Hornik et 


al. 35084 , while the second heuristic approach is f lowMatch developed by myself 36 


Using a representative sample, I show that consensus clustering performs better than 
the individual algorithms under different cluster validation methods. The superior 


performance of consensus clustering was also observed in 32 , where the consensus 


clustering by Clue 84 outperformed the “best performing” algorithm based on their 
similarities with the manual (visual) gating (when evaluated by an external validation 
method called the F-measure). 


The rest of the chapter is organized as follows. In Section 3^, I discuss several 
simple clustering algorithms that can be used as off-the-shelf tools to cluster FC 
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data. Section [3731 discusses different cluster validation methods. The next Section |3]4] 
discusses the two heuristic algorithms for constructing a consensus clustering from 


a collection of partitions (cluster ensembles). In Section 3.5, I demonstrate different 


aspects of clustering algorithms with two FC samples from two separate datasets. I 


conclude the chapter in Section 3.6 


3.2 Clustering algorithms 


Clustering (automated gating) is arguably the most researched topic in compu¬ 


tational cytometry 25-28,31,32,58,79-83,85 . A state-of-the-art summary of this 


held is discussed by Aghaeepour et ah 32 . These algorithms often take few parame¬ 


ters whose values must be set appropriately to obtain good clustering solutions. For 


example, the spectral clustering package SamSPECTRAL 85 takes two arguments (in 


addition to other arguments), (a) normal, sigma - a scaling parameter that deter¬ 
mines the “resolution” in the spectral clustering stage and (b) separation, factor - 
a threshold controlling clusters separation. To obtain a good clustering solution from 
SamSPECTRAL, these two parameters need to be selected carefully for every dataset. 
As a second example, the simple and fast algorithm f lowMeans also depends on a 

parameter MaxN that inhuences the quality of clustering. I will show how the selection 


of parameter inhuences clustering solution, with an example in Section 3.5.1 


The algorithms discussed in FC literature are built on top of a few fundamen¬ 
tal algorithms, such as the k-means, Gaussian mixture modeling, spectral clustering. 


hierarchical clustering, etc. 33 . Here, I move one step back to see how the basic 


algorithms perform, in a completely unsupervised setting, in clustering FC samples. 
For this purpose, I selected six popular algorithms whose efficient implementations 
are available as open-source packages. The algorithms, their computational complex¬ 


ity, and the R packages that implement them are listed in Table |3.1[ K-means is a 
popular algorithm that partitions n observations into k clusters in which each obser¬ 
vation belongs to the cluster with the nearest mean 


87 . Clara (Clustering LARge 
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Table 3.1 

Several clustering algorithms are listed along with their complexity and 
the R packages implementing them. In the complexity, n denotes the 
number of cells in a sample and p idenotes the number of features (di¬ 
mension) measured pre cell. The parameter I is the number of iterations 
required before an algorithm converges. For Clara, m denotes the number 
of subsamples and s denotes the number of items in each subsample. 


Clustering algorithm Computational complexity R package 


K-means 

86, 

87 

0{nl) stats 

96 


Glara (PA 

M) 

85 

0{n + ms‘^) cluster 

97 



Hclust 


33 


0{n^) 


fastcluster 


89 


GMM 

90 



0{nl) 

Rmixmod 

91 


SOTA 

92 



O(nlogn) 

clValid 

98 


Spectral 

13 

94 

0{n^) 

kernlab 

99 



Applications) is a sampling based algorithm that creates a collection of m subsam¬ 
ples of size s from n observations. Once k representative objects have been selected 
from the sub-dataset, each observation of the entire dataset is assigned to the nearest 


medoid 88 . Hclust is a hierarchical clustering algorithm that builds a hierarchy of 
data points into k clusters. Here, I used the Unweighted Pair Group Method with 


Arithmetic Mean (UPGMA) method 33,89 . GMM is the Gaussian mixture mod¬ 
eling algorithm that models a sample with a mixture of k normal distributions and 
assigns observations into k Gaussian components by the Expectation-Maximization 


(EM) algorithm 90,91 . The Self Organizing Tree Algorithm (SOTA) uses an unsu¬ 
pervised neural network with a binary tree topology and recursively divides clusters 


with the largest diversity until k clusters remain 92 . The spectral clustering works 
by embedding the data points into the subspace of the k largest eigenvectors of a 


normalized affinity/kernel matrix 93,94 . An informative discussion about different 


clustering algorithms can be found in 95 
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I apply each of the six algorithms to cluster an FC sample. For slower algorithms, 
such as the spectral clustering, I downsample the data into a smaller size so that the 
algorithms hnish in a reasonable amount of time. I always use the default arguments 
of these algorithms except the number of clusters k. However, I often perform a 
second phase to estimate the statistical parameters of the clusters identihed by the 
clustering algorithms. For this purpose, I characterize an FC sample with a mixture of 
k normally distributed clusters of p-dimensional points. Each cluster is characterized 
by a multi-dimensional normal distribution and is represented by two parameters n, 
the p-dimensional mean vector, and S, the p x p covariance matrix p6]. To estimate 


the parameters of k clusters, I use the Expectation-Maximization (EM) algorithm 90 


3.3 Cluster validation methods 


Clustering is an unsupervised process of identifying phenotypically similar cell 
populations from an FC sample. When clustering patterns and the number of clusters 
are not known a priori, the clustering solution is evaluated for quality assurance 


33,34,100 . Cluster validation usually answers questions like “how well does the 


resulting partition £t the data?”, “how many clusters are there in a sample?”, and 
“from a collection of algorithms, which algorithm provides the best partition?”. 

In general terms, there are three major cluster validation approaches based on 
external, internal, and relative validation criteria |Mp4] . External validation methods 
evaluate a clustering result based on an a priori knowledge of correct class labels 
or the “gold standard”. Internal validation techniques evaluate how well a given 
partition captures the natural structure of the data based on information intrinsic to 
the data alone. Relative validation methods compare two partitions to identify their 
relative differences. While external and relative validation techniques depend on some 
benchmarking or “gold standard”, internal validation methods do not depend on any 
prior knowledge. Given the limited number of “gold standards” in flow cytometry, 
internal validation is the most practical approach for most FC dataset, and it is 
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the focus of my discussion. However, occasionally prior human interpretation (e.g., 


manual gating) is available for a subset of samples of an FC dataset 32 . In this 
scenario, I discuss the application of external cluster validation methods briefly in 
the latter part of this section. The following discussion is limited to validating hard 
clustering where each data item is assigned to exactly one cluster. Evaluation methods 


for fuzzy clustering are discussed in 100 -102 


3.3.1 Internal cluster validation methods 


Consider an n x p data matrix A storing n p-dimensional observations. In flow 
cytometry, A represents a biological sample measuring p features from n cells. Let 
(i(x, y) be the Euclidean distance between two data items, x and y. I use (i^(x, y) to 
denote the squared Euclidean distance between x and y. Assume that A is partitioned 
into k clusters C = {ci, C 2 , ..., Ck}, where the cluster q contains \ci\ data items 
with mean 

Most internal cluster validation methods use the concepts of within-cluster com¬ 
pactness and between-cluster separation. The former concept minimizes the intra¬ 
cluster distances, while the latter maximizes the between-cluster distances. In the 
discussion on internal validation methods, I use A(cj) to denote the intra-cluster dis¬ 
tance of a cluster Cj and D{ci, Cj) to denote the between-cluster distance from q to Cj. 
Different cluster validation methods dehne the intra- and inter-cluster distances dif¬ 
ferently. Based on these dehnitions, I discuss hve popular internal cluster validation 
methods that I use to evaluate automated gating of FC samples. 


Calinski-Harabasz (CH) Index: Calinski and Harabasz 103 dehned the intra¬ 
cluster distance A(ci) of a cluster q as the sum of squared distances from each data 
item X G Cj to the cluster center pi^. They dehned the separation D(cj) of cluster Cj 
as the squared distance between the center pi^ and the overall center pi of the entire 







45 


sample. Then the Calinski-Harabasz (CH) index 103 is defined as the ratio of the 
average cluster separation to the average within-cluster distance. 


^(c*) = Mi), D{ci) = |ci| fi), 

xGCi 


CH = 


(n-k) D{ci) 

Ci^C 

i.k-1) E ^(Ci) 


(3.1) 


A large value of the CH index denotes larger separation among clusters relative to 
the average within-cluster distance. The CH index is limited to the interval [0, -|-cxd] 
and should be maximized. The time needed to compute the CH index is 0{n + k), 
which is linear in the number of data points, and thus faster for large data matrices. 

Dunn’s Index: Dunn defined the intra-cluster distance A(ci) of a cluster q as the 
maximum distance between pairs of data items x, y G Cj and the between-cluster 
distance D{ci,Cj) for a pair of clusters c, and Cj as the minimum distance over all 


pairs of data items x G q and y G m. Dunn’s index 104 then measures the ratio of 


the smallest between-cluster distance to the largest intra-cluster distance. 


A(q) = max d(x,y), D{ci,Cj)= min d(x,y), 

x.ySc; xSCi.ySCj 


Dunn = 


-Erw ( min D{ci,c. 

max(A(c;) \ r n ar r ' 


(3.2) 


Dunn’s index is limited to the interval [0, -l-oo] and should be maximized. The time 
complexity for computing Dunn’s index is 0{n‘^ + k‘^), which reduces to O(n^) for 
bounded k. Therefore, Dunn’s index is computationally expensive when the data 
matrix is large. 

Average Silhouette Width (ASW): Let a(xj) be the average distance from the 
data item Xj to all other data items within the same cluster, and let 6(xj) be the 
lowest average distance from Xj to a cluster to which Xj does not belong. Then the 
silhouette width of Xj is computed as follows: 

6(xj) - a(xj) 


^(Xi) = 


(3.3) 


max{6(xi),a(xj)}‘ 

The Average Silhouette Width (AWS) of the partition C is computed by averaging 
S'(xj) over all data items 105|. The AWS is limited to the interval [—1,1] and should 
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be maximized. The time to compute the AWS is O(n^), and similar to Dunn’s index, 
it is computationally expensive when the data matrix is large. 


Davies-Bouldin Index: Davies and Bouldin 106 dehned the intra-cluster distance 
A(ci) of a cluster Cj as the square root of the average squared distance from each 
data item x G c* to the cluster center They dehned the between-cluster distance 
D(cj, Cj) for a pair of clusters q and Cj as the distance between their clusters centers 
and ^j. Then the Davies-Bouldin index is dehned as follows: 

1/2 


^(d) = < rn 


{ N L/Z 


A(ci) A(cj) 


(3.4) 


The Davies-Bouldin index is limited to the interval [0, -|-cx)] and should be minimized. 
The time to compute the Davies-Bouldin index is 0{n + k"^), which is linear in the 
number of data points. Therefore it is faster to compute for large data matrices. 

S-Dbw Index: Let cr^(cj) be a p-dimensional vector whose component o'g(ci) 
contains the variance of data items x G q in the dimension. Also let cr^ be a 
p-dimensional vector whose g*^ component contains the variance of data items 
from all clusters in the g*^ dimension. Then the average intra-cluster distance A(C') 
of the partition C is computed as follows: 


A(C) = 



(3.5) 


where ||.|| is the norm. Assume that the density p(cj) of the cluster Cj is dehned as 
the number of data points that lie within a hyper-sphere centered at with A(C') 
as the radius. 


d(D) = where /(x,^J 


xSCi 


0, if d(x, > A(C') 
1, otherwise 


(3.6) 


The average inter-cluster density can be computed as 

1 p{Ci U Cj) 


p{C) = 


E 


Hie mt“{p(Ci).P(9)} ’ 


(3.7) 
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Table 3.2 

Summary of the cluster validation indices. In the complexity, n is the 
number of data points in a sample and k is the number of clusters. The 
dimension of a sample p is omitted from the calculation of complexity. 


Cluster validation methods 


Calinski-Harabasz 
Dunn 104 


103 


Average Silhouette Width 
Davies-Bouldin 106 


S_Dbw 1107 


105 


Range Optimization Time complexity 


[0, CX)] 

maximize 

0{n + k) 

[0, CX)] 

maximize 

0(n^ + k^) 

[-1, 1] 

maximize 

O(n^) 

[0, cx] 

minimize 

0(n + P) 

[0, cx] 

minimize 

O(nk^) 


where p(cj U Cj) is the density computed after merging q and Cj and then using 


Eq. 3.6 The S_Dbw index 107 is then dehned as the summation of the normalized 


intra-cluster distance and average inter-cluster density: 

{ACf 


S.Dbw = 


+ p{C). 


(3.8) 


The S_Dbw index is limited to the interval [0,-|-cxd], and should be minimized. The 
time to compute the S_Dbw index is 0{nk‘^), which is linear in the number of data 
points, and can be computed fast for large data matrices. 


Table 3^ gives a summary of these hve internal cluster validation methods. Be¬ 
sides these hve internal cluster validation methods, several other methods have also 


been proposed in the literature, such as Ball-Hall 108 , C 109 , Ray-Turi 110 


Scott-Symons [111 , and other indices. The papers 34,95,112 provide more details. 


Different cluster validation methods are also implemented in several R packages such 
clValid 


as civ 113 


114 , clusterCrit 115 


3.3.2 External cluster validation methods 

Unlike internal cluster validation methods, external methods evaluate a clustering 
result based on a prior knowledge of correct class labels or the “gold standard”. In 
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FC, we often seek completely unsupervised clustering because of the unavailability 
of “gold standards”. However, occasionally prior human interpretation (e.g., manual 


gating ) is available for a subset of a larger FC dataset 32 . Then the external cluster 
validation methods can be applied to evaluate the consensus between the clustering 
solution provided by an algorithm and the known “ground truth”. 


In flow cytometry, F-measure 116 is a commonly used external method to eval¬ 
uate the closeness of a partition to the “gold standard” or manual gating. Assume 
that C = {ci, C2,..., Cfc} is a set of clusters computed by a clustering algorithm while 
T = {^ 1 ,^ 2 , is a set of true clusters of the same sample. Then, precision is the 

number of data points in the computed cluster Cj that belong to the correct cluster 
ti and recall is the number of data points in the true cluster that are recovered by 
the computed cluster Cj. The F-measure F{ti,Cj) between ti and Cj is the harmonic 
mean of precision and recall. Let \ti\ and \cj\ be the number of data points in ti 
and Cj respectively and \ti fl Cj\ be the number of points belonging to both U and 
Cj simultaneously. Then the precision, recall, and F-measure between F and Cj are 
dehned as follows. 

Precision(fj, Cj) = Recall(fj, Cj) = 

rp/i \ _ 2 • Precision(ii,Cj) • Recall(tj,Cj) 

V ^j) Precision(tj,Cj)+Recall(ij,Cj) 

The overall F-measure for the partitioning C with respect to T is computed as 


(3.9) 


F(T,C) = V!^max{F((„c,)}, 

n c,gc 

ti^T 


(3.10) 


where n is the number of data points in the sample. 

Several other external cluster validation methods were also proposed in the lit¬ 


erature, such as Rand index 117 , Jaccard coefficient 118 , Minkowski score 119 


Fowlkes-Mallows index 120 , etc. External cluster validations based on combinatorial 
optimization have also been used, such the R-metric, transfer distance or partition dis¬ 
tance that computes the minimum number of augmentations and removals of points 


needed to transform one partition into another 121-124 . Note that the external clus¬ 
ter validation approaches always compare two partitions of the same sample. These 
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methods are used to compare the quality of a clustering algorithm relative to the 
“gold standard” and to compare the performance of multiple algorithms applied to 
the same sample. However, they cannot be used to compute the dissimilarity between 
partitions from two different samples. In Chapter I will discuss a combinatorial 
algorithm for computing dissimilarities between a pair of partitions from different 
samples. 

3.3.3 Selecting the number of clusters (cell populations) 

Most clustering algorithms take the number of clusters k as an input and assign 
the data points into k clusters. However, the optimum number of clusters k* is often 
not known in unsupervised clustering problems. Especially in FC, the number of phe- 
notypically distinct cell populations may not be known in advance. In this situation, 
a clustering algorithm is run for different number of clusters in a range [kmin, kmax]- 
The clustering quality for each k is then computed by internal cluster validation meth¬ 
ods and an optimum number of clusters k* is selected by maximizing/minimizing the 
validation indices. For example, the values of avg. silhouette width (ASW), Calinski- 
Harabasz (C-H), and Dunn indices are maximized and Davies-Bouldin and S_Dbw 
indices are minimized to obtain the optimum number of clusters. However, if differ¬ 
ent cluster validation methods disagree with each other about k*, a consensus (e.g., 
majority voting) of them is used to select k*. 

3.3.4 Selecting the “best” algorithm 

Given a collection of algorithms, which algorithm should we use to cluster a sample 
for the selected optimum number of clusters k*7 To select an algorithm for clustering 
a particular sample, I consider the following three conditions. First, if most clustering 
algorithms perform equally well under all validation indices, then I select the fastest 
of the better performing algorithms. Second, if an algorithm consistently performs 
better than others under several validation methods, then I select the best-performing 
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algorithm. Finally, when no agreement can be reached about the best performing 
algorithm because different validation methods favor different algorithms, I compute 
a consensus of the clustering solutions provided by the algorithms. 


3.4 Consensus clustering 


To describe how a consensus clustering is computed, we hrst need a dissimilarity 
measure between a pair of partitions. Assume that a partition is dehned as a binary 
membership matrix P such that P[i,j] = 1 if the object belongs to the cluster 
and otherwise, P[i,j] = 0. Then the dissimilarity between two partitions P and Q is 
dehned as 


d(P,Q) = mm||P-Qn||, 


(3.11) 
is the Frobenius 


where the minimum is taken over all permutation matrices fl, and | 
norm (so that ||T|p = tr(y'y')). This dissimilarity measure computes the minimum 
number of augmentations and removals of items needed to transform one partition 


into another and is called the transfer distance or partition distance 121 -123 . The 
partition distance can be computed by formulating an edge-weighted matching prob¬ 
lem in a bipartite graph and solving it by efficient combinatorial algorithms (e.g., the 
Hungarian method) in 0{k^) time, where k is the number of clusters in the parti¬ 
tions [m|[i^[i^ . 

Given a collection of partitions E (possibly generated by several algorithms), a 
consensus clustering algorithm hnds a new partition P such that the total dissimi¬ 


larity between P and each partition in E is minimized 35,126,127 . The initial set 


of partitions E is often called a cluster ensemble 35,84 . An optimum consensus 
partition P* minimizes the total dissimilarity between P and each partition in the 
ensemble E\ 


P* = argmin min | |P — Qn| |. 


p n 

QgE 


(3.12) 


This optimization is an instance of multidimensional assignment problem, which is 
known to be NP-hard (by reducing it to the 3-dimensional matching problem 128|). 
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Hence, various heuristics have been developed to solve the consensus clustering prob¬ 
lem. Here I will discuss two heuristic approaches to approximately compute a con¬ 
sensus clustering. 


The hrst heuristic approach is called Clue developed by Hornik et al. 35,84 


Clue initializes the consensus partition P with a randomly selected partition from 
the ensemble. In the iteration, the algorithm randomly selects a partition Q not 
already included in the consensus clustering. The algorithm then optimally matches 
the clusters between Q and Pi-i (the consensus partition created thus far) and creates 
an updated consensus partition Pj by taking a weighted average of Pj_i and QUi, 
where Hj is the permutation matrix found by the optimum matching. 

The second heuristic approach is flowMatch developed by myself [^. f lowMatch 
computes the dissimilarity between partitions by a mixed edge cover (MEG) algorithm 
that allows a cluster from a partition to be matched to zero or more clusters in 


another partition 37 . Let N be the number of partitions in the ensemble E. In 


each iteration, the flowMatch algorithm Ends the most similar pair of partitions, 
merges them into an intermediate consensus partition and continues. The merging 
is performed by matching clusters across two partitions and collapsing the matched 
clusters into meta-clusters. Let P be the consensus-partition with K meta-clusters 
created by the above algorithm. The collection of these meta-clusters then dehnes 
the consensus partition. I will discuss this algorithm in more detail in Chapter and 
Chapter 

3.5 Results 

3.5.1 Identifying cell populations in a healthy sample 

Data description: I cluster a representative sample from the HD dataset described 
in Section [1.6.1 Each sample in this dataset measures the abundance of CD45, CD3, 
CD4, CDS, and CD19 protein markers in peripheral blood collected from a healthy 
individual. I selected a representative sample from this dataset to demonstrate differ- 
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Figure 3.1. Evaluating the performance of six clustering algorithms 
(shown in different colors) by five internal cluster validation methods 
(shown in different panels). A representative 5-D sample from the HD 
dataset is clustered by the algorithms for a sequence of k. Each clus¬ 
tering solution is then evaluated by five cluster validation methods. The 
optimum number of clusters, k* is selected by maxima in the top three 
methods and by minima in the bottom two methods. 


ent aspects of clustering algorithms. The selected sample is compensated to remove 
the effect of spectral overlap, transformed to stabilize variance, and gated on FS/SS 
channels to identify lymphocytes. The preprocessed five-dimensional data is then 
clustered to identify different subsets of lymphocytes in the healthy immune system. 

Selecting the optimum number of clusters (k*): I have applied six clustering (Ta¬ 


ble 3.1) algorithms to the 5-D healthy sample for different number of clusters in the 
interval 2-10. The range for k can be increased if the optimum solution is not found 
in the selected range. Each clustering solution is evaluated by five internal cluster 


validation methods described in Table 3.2 Fig. 3.1 shows the values of different clus¬ 


ter validation indices for different algorithms. The optimum number of clusters k* is 
selected by maxima in the top three methods (Calinski-Harabasz, Dunn, and Average 
silhouette width) and by minima in the bottom two methods (Davies-Bouldin and 
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S_Dbw). Form Fig. 3.1 we observe that all validation methods unanimously indicate 
A; = 4 as the optimum number of clusters across all clustering algorithms except for 
the spectral clustering. The spectral clustering does not perform well because a user 
needs to pass a set of carefully tuned arguments to the algorithm for better perfor¬ 
mance 85 . For this sample, I select fc* = 4 as the optimum number of clusters as 


per the consensus of the hve cluster validation methods. 

Researchers have proposed several ad hoc criteria to select the optimum number 
of clusters k* for FC data. For example, the flowMeans algorithm developed 
by Aghaeepour et ah starts with a relatively large number of clusters (Max k) and 
merges two closest clusters in each iteration. The algorithm then plots the distances 
between the merged clusters at each iteration and selects the value of k* where a 
sharp change in the segmented regression lines is observed. I cluster the same healthy 


sample used in the previous paragraph by the flowMeans software package 129 for 
three choices of Maximum k (parameter MaxN in flowMeans package) and plot the 
results in Fig. m Observe that flowMeans algorithm selected three different k* 
(shown by red circles) for three different choices of initial number of clusters. Only 
Fig. 3.2[ b) selects the correct k* as unanimously suggested by hve well-established 
cluster validation methods in Fig. |3.1 Therefore, flowMeans algorithm depends on a 
hidden parameter (Max k) when selecting k*. flowMeans estimates the initial number 
of clusters to 5 when the user does not supply it; however. Fig. |3.2K a) shows that the 
auto tuning fails to identify the correct number of clusters for this sample. Hence, 
I prefer using well-known cluster validation methods to select optimal number of 
clusters k*. 

Selecting the “best” algorithm: Which clustering algorithm should we use to clus¬ 
ter this sample? Fig. 3.1| shows that except for spectral clustering, all clustering 
algorithms perform equally well since the validation indices have optima at fc = 4. 
In this case, computing a consensus clustering does not improve the quality of the 
overall clustering solution because the individual clusterings from hve algorithms (for 
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(a) Max k=5 



(b) Max k=10 



10 8 6 4 2 


(c) Max k=20 



Number of clusters (k) 


Number of clusters (k) 


Number of clusters (k) 


Figure 3.2. Selecting the optimum number of cell populations in a sample 
from the HD dataset by the flowMeans package 129| . The maximum 
number of clusters (MaxN parameter) is set to: (a) 5 clusters (automatically 
selected by algorithm), (b) 10 clusters, and (c) 20 clusters. The optimum 
number of clusters is selected by detecting change point in the segmented 
regression lines and is shown with a red hlled circle in each subhgure. 


k = 4) have the same quality under different validation methods. Therefore, I select 
the fastest k-means algorithm to cluster this sample. 

Identifying lymphocyte sub-populations: The four clusters chosen by the k-means 
algorithm represent four biologically distinct lymphocyte sub-populations dehned 
in the hve dimensional marker space. For visualization purposes, I show the cell 
populations by a collection of 2-D projections in Figure |3.3| (b), where cell pop¬ 
ulations are shown in four different colors denoting (a) red: natural killer cells 
(CD45+CD3-CD19-), (b) blue: B cells (CD45+CD3-CD19+), (c) black: helper 
T cells (CD45+CD3+CD4+), and (d) green: cytotoxic T cells (CD45+CD3+CD8+). 
Here, every cell cluster is CDdS"*" because CD45 is a common leukocyte antigen present 
in all lymphocytes and I pre-selected lymphocytes on the forward and side scatter 
channels. 


3.5.2 Identifying cell sub-populations in T cells 


Data description: As a second example, I cluster a pre-stimulation sample from the 


T cell phosphorylation (TCP) dataset described in Section 1.6.2 The selected sample 


measures the abundance of four protein markers expressed on T cells: CD4, CD45RA, 
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Figure 3.3. (a) Simultaneous optimization of five cluster validation cri¬ 
teria (scaled to [0,1]) suggests that four cell populations are present in 
the healthy sample. Here three of the indices are maximized and two 
are minimized, (b) Bivariate projections of lymphocyte sub-populations: 
red (natural killer cells), blue (B cells), black (helper T cells), and green 
(cytotoxic T cells). 


SLP-76 and ZAP-70. The first two markers (CD4, CD45RA) are expressed on the 
surface of different T cell subsets and the last two (SLP-76 and ZAP-70) are highly 
expressed after T cells are phosphorylated [^. Since I selected a pre-stimulation 
sample (for simplicity), I consider only CD4, CD45RA to identify different subsets of 
T cells by using clustering algorithms. 

Selecting the optimum number of clusters (k*): I have applied six clustering al¬ 
gorithms to the selected 2-D sample. The algorithms were run for different number 
of clusters and each clustering solution is evaluated by five cluster validation indices. 


The values of the cluster validation indices are shown in Fig. 3.4 The optimum num¬ 


ber of clusters k* is selected by maxima in the top three methods (Calinski-Harabasz, 
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Figure 3.4. A representative 2-D sample from the TCP dataset is clustered 
using six clustering algorithms for a sequence of k (number of clusters). 
Each clustering solution is evaluated by hve cluster validation methods 
(shown in different panels). The optimum number of clusters k* is selected 
by the maxima in the top three methods and by the minima in the bottom 
two methods. 


Dunn, and Average silhouette width) and by minima in the bottom two methods 


(Davies-Bouldin and S_Dbw). Form Fig. 3.4 we observe that all validation methods 
unanimously indicate fc = 4 as the optimum number of clusters across all cluster¬ 
ing algorithms except for the spectral clustering. Therefore, for this sample, I select 
fc* = 4 as the optimum number of clusters. 


Selecting the “best” algorithm: Unlike the HD sample (Fig. 3.1), the values of the 
cluster validation indices are different across the algorithms for fc* = 4 as can be seen 


in Fig. 3.4 Furthermore, different algorithms perform best under different validation 


indices. For example, the hierarchical clustering is the best algorithm under Dunn’s 
index, while K-means and Clara are the best performers under Calinski-Harabasz 
index. Therefore, creating a consensus clustering is useful for this sample. 
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Algorithms after setting k = 4 


Figure 3.5. Five cluster validation methods are used to compare the 
quality of the “best-performing” algorithms with two consensus clustering 
methods (Clue and flowMatch) for a 2-D sample from the TCP dataset. 
A “better” clustering solution is denoted by higher values of the hrst three 
indices and lower values of the last two indices. 


I compute consensus solutions by two heuristic algorithms, Clue and flowMatch, 


discussed in Section |3.4[ The consensus clusterings are computed from the solutions 
of six clustering algorithms after the number of clusters is set to four (fc = 4). Once 
again I evaluate the performance of the consensus clustering solutions by hve cluster 
validation methods whose values are scaled to [0,1] for comparison purpose. I show 
the values of the validation indices in hve different panels in Fig. |3.5[ In addition to the 
consensus partitions, I show the validation indices of the best-performing algorithm 
under each validation method in its respective panel. According to Table 3.2, a 


“better” clustering solution is denoted by higher values of the hrst three indices 
and lower values of the last two indices. We observe in Fig. |3.5| that the consensus 
solutions perform better under four validation methods (except the Dunn’s index) 
than the best performing algorithms. The consensus clustering based on flowMatch 
performs slightly better than the solution based on Clue under all validation methods. 

The relatively poor performance of the consensus methods under Dunn’s index 
can be explained if we look closely at the dehnition of the within-cluster and between- 


cluster distances used in Eq. 3.2, Notice that the within-cluster distance is dehned by 


the diameter of a cluster and the between-cluster distance is dehned as the minimum 
distance between a pair of clusters. These dehnitions are not robust especially in the 
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CD4 


Figure 3.6. The consensus clustering computed by the flowMatch al¬ 
gorithm created from the solutions of six clustering algorithms with 
k = A. Four T cell sub-populations are shown in different colors; black: 
CD4+ CD45RAi°'^ T cells, green: CD4+ CD45RA*^ig^' T cells, red: CDd" 
CDdbRA*^^®^ T cells, and blue: CD4“ CD45RA*°" T cells. 


presence of outlying data items that are far apart from their cluster centers. Therefore, 
Dunn’s index is less robust when evaluating the quality of robust consensus clustering 
solutions. 

Identifying T cell sub-populations: I select the consensus clustering by flowMatch 
algorithm as the hnal clustering solution for the selected TCP sample. The consensus 


clustering solution for this sample is shown in Fig. 3T, where four clusters representing 
different T cell sub-populations are shown in different colors: (a) CD4’'' CD45RA^°" T 
cells in black, (b) CD4+ CD45RA'^^®'^ T cells in green, (c) CD4“ CD45RA^’®'^ T cells 
in red, and (d) CD4“ CD45RA^°’" T cells in blue. Here ‘-f’ and ‘high’ indicate higher 
abundances of a marker, and ‘ —’ and ‘low’ indicate lower levels of it. Two clusters 
with high CD4 expression, CD4+ CD45RA*°" and CD4+ CD45RA'^’®*^ cells, are called 
the memory and naive T cells respectively. Few domains of the CD4’'' cell subsets are 
phosphorylated upon stimulated by anti-CD3 antibody. To study the effect of this 
stimulation was the main purpose of the TCP dataset generated originally by Maier 
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et al. 44 . I will discuss algorithms to automatically detect phosphorylation changes 


across 


pre- and post-stimulated samples in Chapter and Chapter 


3.6 Conclusions 


We can achieve better conhdence in a computed clustering when several algo¬ 
rithms produce similar clusterings. I demonstrated that several simple, off-the-shelf 
clustering algorithms can be used to compute more accurate and robust partition of 
an FC sample than a specialized algorithm. Internal cluster validation methods can 
be used to select parameters of an algorithm such as the optimal number of clusters. 
More than one validation method yet again provide more conhdence and robustness 
in model selection. 

I have discussed a set of hve cluster validation methods - Calinski-Harabasz, Dunn, 
Average silhouette width, Davies-Bouldin and S_Dbw - that can be simultaneously 
optimized to select algorithm parameters, as well as the best-performing algorithm for 
an FC sample. However, it is possible that different validation methods disagree about 
the best performing algorithm because different validation methods favor different 
algorithms. Then a consensus of different clustering algorithm usually provides a 
“better” solution. I use the f lowMatch algorithm (originally developed for creating 
a template from similar samples) to create consensus clustering from a collection 
of partitions (cluster ensembles). By comparing the consensus clustering with the 
“best performing” algorithm, I showed that the consensus solution performs better 
when evaluated by different cluster validation methods. The superior performance 
of consensus clustering was also observed in the clustering challenges organized by 
the FlowCAP consortium (http://flowcap.flowsite.org/). In these challenges, 
more than 15 clustering algorithms developed by different independent researchers 
were used to cluster same set of samples. However, often the consensus clustering by 


Clue 84 outperformed the “best performing” algorithm based on their similarities 


with the manual (visual) gating 32 
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Clustering (automated gating) is arguably the most researched topic in computa¬ 
tional cytometry. More than twenty algorithms customized for FC data are available 
as software packages to be used by cytometrists. However, the large number of options 
often confuses cytometrists due to the lack of validation methods needed to select a 
clustering algorithm for a particular dataset. I argue that several cluster validation 
methods should be used to evaluate the clustering quality and that a consensus clus¬ 
tering be constructed whenever necessary. This approach produces a robust clustering 
solution and increases conhdence on the quality of clustering in FC data. 
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4 REGISTERING GELL POPULATIONS AGROSS EG SAMPLES 


4.1 Introduction 


Gell populations with different phenotypes and functions often respond differ¬ 
ently upon perturbation by stimuli or change of biological conditions. To track these 
population-specihc changes, we need to register corresponding cell clusters across sam¬ 
ples. The population registration is a process where phenotypically similar cell clusters 


are matched to establish the correspondence of clusters across samples 27,37 . Fur¬ 
thermore, the matched clusters are often labeled with abstract or biologically relevant 


descriptions, thus solving the population/cluster labeling problem 39 


I solve the population registration problem with a combinatorial mixed edge cover 


(MEG) algorithm 37 . Given a pair of clustered sample, I create a bipartite graph 
whose vertices are denoted with cell clusters from the samples. An edge of the graph 
is created by joining a pair of clusters with the dissimilarity between the clusters as 
the edge-weight. The MEG algorithm then matches vertices (clusters) in the bipartite 
graph by optimizing an appropriate objective function. Unlike conventional matching 
algorithms, the MEG algorithm allows a cluster from one sample to be matched with 
zero or more clusters in another sample and thus solves the problem of missing or 
splitting biological populations. In addition to solving the population registration 
problem, an optimal MEG solution provides a combinatorial dissimilarity measure 
between a pair of partitions from different EG samples. This between-sample dissim¬ 
ilarity works as a building block of the meta-clustering and classihcation approaches 
discussed in Ghapter|^and Ghapter]^ 

The ability to register multi-dimensional populations across EG samples is useful 


in various applications of flow cytometry 27,36,37,130,131 . In a cluster labeling 


application, cell clusters in a collection of samples S are labeled based on the given 
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labeling of another FC sample A. This problem can be solved by matching clusters 
between A and each sample -B G §, and labeling each cluster in B same as to the 
matched cluster from A. I give an example of labeling different sub-populations of 
lymphocytes in samples from a healthy dataset. In addition to labeling clusters, 
the MEG algorithm allows an objective way of evaluating the phenotypic differences 
between the matched clusters. By mapping clusters across two samples before and 
after stimulating T cells, I demonstrate how we can assess the population-specihc 
effects of the stimulation experiment. Furthermore, I register different cell types in a 
sample from an Acute Myeloid Leukemia (AML) patient with a healthy sample and 
demonstrate a conventional procedure of AML diagnosis. 

The MEG algorithm is similar - only in principle - to the FLAME approach pro¬ 


posed by Pyne et ah 27 , while differing from it in signihcant ways. First, FLAME 
does not match clusters directly across a pair of samples. Instead, it constructs a tem¬ 
plate of a class of samples and matches generic meta-clusters from the template to the 
clusters from each sample; therefore, performs a global registration of clusters. In con¬ 
trast, MEG algorithm performs a direct, sample-to-sample cluster matching, thereby 
emphasizes the local patterns. In fact, the approach taken by FLAME is more sim¬ 
ilar to the meta-clustering algorithm discussed in Section Second, FLAME solves 


a relaxed min-cost flow problem 132 by an integer-programming solver, whereas 
MEG uses a combinatorial algorithm to solve a relaxed minimum-weight matching 
in a bipartite graph. Several other subtle differences also exist between these two 
approaches, such as the methods used to compute distance between clusters and the 


use of cluster sizes in matching. The cluster labeling approaches 23,39 discussed 
in FG literature are very different from the MEG algorithm. These labeling algo¬ 
rithms run a second stage of clustering from the cluster-centers and hence perform a 
global labeling similar to the hrst phase of FLAME. I will discuss more about this 
approaches in the next Ghapter 

The rest of this chapter is organized as follows. At hrst, I discuss different meth¬ 
ods to compute dissimilarity between a pair of clusters, which is used by population 
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registration algorithms. In Section 4.3, I describe the mixed edge cover algorithm 


with a proof of its optimality. Section 4.4 describes different properties of the MEC 
algorithm and their implications on the population registration problem. In Sec¬ 


tion |4.5[ I demonstrate, with three representative datasets, that the MEC algorithm 
matches phenotypically similar populations across samples. I conclude this chapter 
in Section 14.61 


4.2 Dissimilarity between a pair of cell clusters 

Computing distances between clusters is an integral part of any cluster-matching 
algorithm. In the past, researchers have used various parametric and non-parametric 


methods in this purpose, such as the Euclidean distance 27 , Mahalanobis distance 


[^, the Kullback-Leibler (KL) divergence 36,37 , and the Earth Mover’s distance 


(EMD) [133 . To compute dissimilarity between two clusters ci and C 2 , the parametric 
methods use distribution parameters of the clusters while non-parametric methods use 
statistical signihcance testing. I discuss very briefly several widely used parametric 
and non-parametric dissimilarity measures. 

Parametric dissimilarity measures: To compute parametric distances, it is as¬ 
sumed that a cell cluster approximately follows a known probability distribution. 
Dissimilarity between a pair of cell clusters is then computed by the dissimilarity 
between the corresponding multivariate probability distributions. To model cell pop¬ 
ulations, researchers have used several well-characterized distribution families, such 


as normal, skew-normal, t, skew-t distributions 23,25-27,36 . For simplicity, I as¬ 


sume normally distributed cell populations in the rest of my discussion, but other 
distributions are equally applicable to the dissimilarity methods discussed below. 

Let Ci(^ii,Si) and € 2 ( 1 - 12 , ^ 2 ) be two cell populations modeled by multivariate 
normal distributions with mean vectors ^ 1 , I-I 2 and covariance matrices Si, S 2 re¬ 
spectively. The Euclidean distance computes the distance between the centers (^1 
and 112 ) of the distributions without considering the spreads of the distributions. The 
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symmetric version of the Kullback-Leibler (KL) divergence 134, 135] uses both centers 
and covariances of the distributions to compute dissimilarity (ixL(ci,C 2 ) between ci 
and C 2 : 

d_fs'L(Cl, C 2 ) = 2 (^1 ~ {^1 ^ + ^2 ^* 2 ) + ^^2 + ^2 ~ 21). (4.1) 

The symmetrized KL divergence, however, is not a metric because it does not satisfy 
the triangle inequality. Mahalanobis distance 136| uses the pooled estimate of com¬ 
mon covariance Sp from the two distributions to compute distance between clusters: 


Sp — 


(ni - l)Ei (n2 - 1)^2 


dMoici, C2) — 2 (^1 


^2)'Sp-n^i-A^2). (4.2) 


■^1 + ^2 - 2 

Note that, the above dehnition of Mahalanobis distance satishes metric properties, 
whereas the Mahalanobis distance dehned by Aghaeepour et. al. (Equation 3.7) 
is not a metric since it fails to satisfy triangle inequality |133 . 


Non-parametric dissimilarity measures: The non-parametric measures use sta¬ 
tistical signihcance testing to compute dissimilarity between cell populations. For 
example, the Kolmogorov-Smirnov (KS) statistic [137 138 computes the maximum 
absolute vertical difference between two cumulative distribution functions (CDFs). 
Another non-parametric measure uses chi-squared statistics after applying adaptive 
probability binning |133 that divides a p-dimensional cell population into bins such 
that each bin contains the same number of cells. The normalized chi-squared statis¬ 
tics is then used to compute the dissimilarity between the populations. The earth 
mover’s distance (EMD) |133 139 considers the histograms of two populations as 
two piles of dart in multivariate space and computes the minimum cost of moving 
one pile into another. Here the cost of moving dart is dehned as the amount of dirt 
moved times the distance by which it is moved. EMD can be efficiently computed by 
hrst applying adaptive probability binning on the cell populations and then solving 


transportation problem on the bins across the clusters 139 


In the rest this chapter, I used Mahalanobis distance to compute dissimilarity 
between a pair of clusters because it is a metric and uses the centers and covariance 
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matrices of the clusters. However, every algorithm described in this thesis works with 
other dissimilarity measures as well. 


4.3 The mixed edge cover (MEC) algorithm 
4.3.1 Overview of the algorithm 


Given a dissimilarity measure between cell cluster, I develop an algorithm to 
optimally register clusters across a pair of FC samples. I make the algorithm robust 
by allowing a cluster from one sample to be matched to zero or more clusters in 
another sample. This approach covers possible circumstances when a cell cluster in 
one sample is absent from another sample, or when a cluster in one sample splits into 
two or more cell populations in a second sample, which can happen due to biological 
reasons or due to the limitations of clustering methods. 

I characterize an FC sample by a mixture of cell clusters. Consider two FC 
samples A and B containing of ka and kb cell clusters such that A = {oi, 02 ,..., 
and B = {bi, 62 , ^ 4 }, where Oj is the cluster from A and bj is the cluster from 

B. I developed a robust variant of matching algorithm called the Mixed Edge Cover 
(MEC) algorithm that matches a cluster from A to zero or more clusters from B |^ . 
Suppose mec is a matching of clusters across A and B such that mec(ai) G 'P{B) 
and iaec(bj) G 'P{A), where V{A) (V{B)) is a subset (possibly empty) of clusters in 
A (B). When a cluster a* (or bj) remains unmatched under mec, i.e., mec(ai) = 0, 
I set d{ai, —) = X where the hxed cost A is used as a penalty for leaving a cluster 
unmatched, and is set to a value such that the number of such clusters remains 
small. I select A empirically by plotting the total number of unmatched clusters in 


all pair wise matchings of samples against A, and Ending a “knee” in the curve 37 


The cost of mec is therefore computed as the summation of the dissimilarities of all 
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pairs of matched clusters and the penalties coming from the unmatched clusters. A 


minimum-weight mixed edge cover is a mixed edge cover with the minimum cost: 



(4.3) 


Here d{ai,bj) is the dissimilarity between clusters a* and bj. 

4.3.2 Bipartite graph model 

I compute a minimum-weight mixed edge cover (MEC) algorithmically in a com¬ 
plete bipartite graph G(A, B, E) created from a pair of sample, A = {oi, 02 ,..., 0 ^^}, 
and B = { 61 , 62 , •••, 6 a;j}. In the bipartite graph G, each cluster ai ^ A represents a 
vertex in one part and each cluster bi G B from represents a vertex in the other part. 
The edge set E is created by joining each pair (a,, bj) E {A x B) of vertices with an 
edge of weight d{ai,bj). Here, d{ai,bj) denotes the dissimilarity between Oj and bj 
and is computed by using Mahalanobis distance from Equation 4.2[ 

Since low edge weight implies high similarity among vertices (clusters) I match 
vertices connected by edges with small weights and leave a vertex unmatched when 
it has no low-weight edge adjacent to it. In the above graph model, a mixed edge 
cover (MEC) is a collection of edges EC and vertices Vum such that each vertex in 
AVJB\Vum has at least one edge incident on it and Vum remains unmatched with hxed 
penalty A for each vertex. Thus, a minimum-weight MEC is an MEC that minimizes 
the following objective function: 



(4.4) 


min 


According to graph-theoretic terms, an MEC is a generalization of an edge cover 
represented by a subset of edges such that each vertex in the graph has at least one 
edge incident on it. An edge cover again is a generalization of a matching represented 
by a subset of edges such that each vertex in the graph has at most one edge incident 
on it. 
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4.3.3 MEC algorithm on the bipartite graph 


A mixed edge cover in G can be computed from an edge cover in a transformed 
graph G' obtained from G by introducing two new distinguished vertices oq G A and 
bo E B representing two dummy clusters one in each sample. In G', I add an edge 
{oo, &o} with d{ao, bo) = 0, and edges {a*, 6o} for each a* G A, {bi, Oq} for each bj G B, 
with d{ai,bo) = d{bj,ao) = A. For each such edge { 0 ^, 60 } (or {bj,ao}) included in a 
minimum-weight edge cover computed on G', I leave the Oj (bj) unmatched in a MEC 
computed on G, thereby paying a price of A for each unmatched vertex. A minimum- 
weight edge cover in G' can be computed in polynomial time by making a copy of the 
graph and connecting each vertex to its twin in the copy by an edge with weight equal 
to twice the minimum weight among original edges incident on it. A minimum-weight 
perfect matching in this graph can be used to compute a minimum-weight edge cover 


in the original graph 132 


Following the above discussion, a minimum-weight MEC in G can be computed 
in the following step: 


1 . Add dummy vertices: Create a new augmented graph G' from G by adding 
two distinguished vertices Oq G A and bo E B representing two dummy clusters 
one in each sample. In G' I add an edge { oq , & o } with d{ao, bo) = 0, and edges 
{ai,bo} for each a* G A, {bj,ao} for each bj E B, with d{ai,bo) = d{bj,ao) = A. 
Here A is the penalty for leaving a vertex unmatched. 

2 . Duplicate Graph: Let G" be a disjoint copy of G'. I create a new graph G by 

taking the union of G' and G" and adding an edge connecting every 

vertex v' in G' with its twin v” in G”. The weight of edge is set to 

2fi{v'), where /i(n^) is the minimum weight of the edges of G' incident on v'. 

3. Compute matching: Compute a minimum-weight perfect matching M in G. 

4. Compute edgecover from matching: Obtain a minimum-weight edge cover EC' 
of G' by replacing every edge {v',v"} G M by an edge of weight p^iv') in G' 
incident on v'. 
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Figure 4.1. Steps of the MEC algorithm: (a) two FC samples A and B, 
each with three cell clusters, (b) a complete bipartite graph G is created 
from A and B, (c) an augmented graph G' is obtained by adding two 
dummy clusters Oq and feg as described in step 1 of the MEC algorithm, 
(d) an identical copy of G' is combined with it to obtain (5, (e) a minimum- 
weight perfect matching M is computed in G, (f) a minimum-weight edge 
cover EG' in G' is constructed after replacing each edge of M connecting 
G' and its copy with an edge of minimum weight within G', (g) a minimum- 
weight MEC is computed from EG' by removing dummy clusters and their 
adjacent edges, and hnally (h) the clusters are matched across A and B. 


5. Compute MEC from edge cover: Remove all edges { 0 ^, 60 } and {bj,ao} from 
EC' to obtain a reduced edge cover EC in G. Put all vertices in G not covered 
by EC to the set of unmatched vertices (clusters) Vum- The resulting edge cover 
EC together with the set of unmatched vertices Vum is a solution to the mixed 
edge cover problem in G. 


I show a workflow of the MEC algorithm with two hypothetical samples A and 
B each with three clusters in Figure 4.1[ In this example, a single cell population 
is split into two clusters in sample B requiring one-to-many matching of clusters. 
Additionally a cluster in sample A remains unmatched because it does not have 
any corresponding cluster in sample B. The MEC algorithm covers both cases and 
therefore provides a robust solution to the population registration problem. 
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4.3.4 Proof of correctness 


Lemma: The Algorithm described above computes an optimum mixed edge 

cover in G. 

Proof: The correctness of the algorithm for computing the edge cover EC' in the 


graph G' is shown in 132 . I obtain a mixed edge cover in G by deleting, the vertices 


Oo and Bq and the edges incident on these vertices in EC'. Let 14m be the set of the 
vertices adjacent to oq and Bq, which will be identihed as unmatched vertices. Let 
EC be the edges remaining from the edge cover EC' in the modihed graph G \ Vum- 
I claim that EC together with 14m is an optimal solution for the mixed edge 
cover problem in G. Assume that there is an optimal solution in G consisting of 
the set of unmatched vertices 14m and an edge cover EC* in G \ 14m- Clearly 
J2{aibj)&EC = J2{aibj)&EC* Otherwise one of the solutions could 

be improved upon, thereby contradicting their minimality in G \ 14 m respectively. 
It remains to prove that there is no other MEC solution in G with a different set of 
unmatched vertices and smaller cost. Let EC together with 14m gives a MEC solution 
in G with a cost smaller than the MEC solution given by EC and 14m (Equation |4.4[ ). 
Then EC together with an edge {o, oo} or {o, Bq} for every o G 14m and the edge 
{oo, 6o} is an edge cover in G' with cost smaller that the cost of EC', contradicting 
the optimality of EC'. 


4.3.5 Complexity of the MEC algorithm 

For a graph with n vertices and m edges, a minimum-weight edge cover can be 
computed in time 0{n{n + m logn)) |132| . For a pair of sample A = {oi, 02 ,..., o,ka}j 
and B = {Bi, B 2 ,Bk^j, let k = ka + ki, he the total number of clusters. By the 
construction of the graph model n = 0{k), m = 0{k‘^). Therefore the time complexity 
of a MEC algorithm is 0{k^ log k). The number of cell clusters k is usually small (less 
than 50 in typical experiments). As such, the dissimilarity between a pair of samples 
can be computed in seconds on a desktop computer. 
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4.4 Properties of mixed edge cover 

To understand the conditions for an unmatched cluster, let bj be the nearest 
cluster in sample B from a cluster a* in sample A. Then a* is always unmatched if 
d{ai,bj) > 2A and is never unmatched if d{ai,bj) < X. If 2A < d{ai,bj) < A, a* will 
be unmatched if and only if it is not matched to a vertex in duplicate graph during 
step 3 of the algorithm. 

The edges contained in a minimum-weight MEC do not create any path of length 
greater than two. Hence, MEC never creates a cycle in the optimum solution. To 
see this, consider an MEC solution mec consisting of an edge cover EC and a set 
of unmatched vertices I4m- The graph induced by EC in C does not have a path 
of three edges since the middle edge can be removed from the optimal cover, thus 
reducing its weight further, contradicting the optimality of mec. Therefore, mec does 
not contain a path of length greater than two. In fact, mec is a union of disjoint 
“stars” (trees of height 1) where each star has no more than one vertex of degree 
greater than one, preventing cycle or long chain of matched vertices. This property 
allows a cluster from sample A to be matched to the corresponding but spitted set of 
clusters in sample B and vice versa. However, if a cell population is split up in both 
samples, the MEC algorithm is unable to match all parts across samples because this 
will require a path of length greater than 2 in the MEC solution, therefore violating 
the disjoint “stars” constructions. This limitation, however, cannot be solved by an 
algorithm that considers only the between sample dissimilarities. 


4.5 Results 

4.5.1 Registering populations across samples from two healthy subjects 


Data description: I register cell populations across two hve-dimensional samples 
collected from two healthy subjects. These two samples are part of the HD dataset 


described in Section 1.6.1 where blood was collected from Eve healthy individuals 
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Figure 4.2. Selecting the unmatch-penalty, A for the HD dataset. Each 
black curve plots the numbers of unmatched clusters, 14 m across a pair of 
HD samples for a sequence of unmatch-penalty A in the range [0,6]. The 
red curve is a smooth-average of all individual curves and represents the 
general trend for the whole dataset. From the average red curve, A = 3.5 
(shown as blue dashed line) is selected as a suitable unmatch-penalty 
because, at this value, the curve becomes stable and horizontal. 


on different days to study natural and instrumental variations among samples. For 
demonstration purpose, I randomly selected two samples from different subjects in 
the HD dataset. Each sample is preprocessed and clustered independently to identify 
four cell clusters. Figure in Chapter displays these clusters with a collection of 
two dimensional projections. 

Selecting the unmatch-penalty, A .' To determine the unmatch-penalty A for the 
HD dataset, I compute the optimum mixed edge cover (MEC) between each pair 
of samples in this dataset for different choices of A in the range [0,6]. The number 
of unmatched clusters, 14m obtained by each MEC solution is plotted against the 


corresponding A value in Figure 4.2 Each black curve in Figure 4.2 represents the 
the number of unmatched clusters for different choices of A when MEC algorithm is 
applied to a particular pair of samples. I have applied local polynomial regression 
(loess) method to smoothen the curves. The red curve is a smooth-average of all 
individual curves and represents the general trend for the whole dataset. Observe 
that an increase of A quickly decreases the size of 14m, thus allowing more clusters to 
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Figure 4.3. Cell populations are matched by the MFC algorithm across 
two representative samples in the HD dataset. In addition to showing 
individual cells with dots, I draw the 95th quantile of each cell population 
with a dashed ellipse. I show a pair of matched clusters in same color 
and join them by a line with arrowheads. Although the cell populations 
are dehned in hve dimensions, I show a 2-D projection for visualization 
purpose. 


be matched ( from Equation |4.4[ ). From the average red curve, A = 3.5 (shown as blue 
dashed line in Figure |4.2[ ) is selected as a suitable unmatch-penalty because, at this 
value, the curve becomes stable and horizontal. Since the samples are very similar 
(healthy) in this dataset, we do not expect many unmatched clusters and therefore 
A = 3.5 looks like a reasonable choice. I use A = 3.5 for any MEC computation 
between two samples in the HD dataset. 

Registering populations: I now apply the MEC algorithm, with A = 3.5, to match 
cell clusters across the selected pair of samples. For visualization purpose, I project 


the matched populations onto two dimensions (CD3 and CD19) in Figure 4.3 I 
selected CD3 and CD19 markers for projection because they cleanly dehne three 
functional cell subsets - T cells, B cells and Natural Killer (NK) cells. The MEC 
algorithm, however, does not use these biological labels, but views each cluster as 
an unlabeled distribution of cells dehned only by the distribution parameters. In 
Figure |4.3 I show a pair of matched clusters in same color and join them by a line 
with arrowheads. The matched clusters represent functionally equivalent cell subsets 
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and therefore confirm that the algorithm is actnally matching biologically eqnivalent 
cell populations across samples. Additionally the proportion of lymphocytes in each 
clusters matches closely across the samples: T cell (71.1%, 75.4%), B cells (18.4%, 
15.1%) and NK cells (10.5%, 9.5%). The proportions closely follow the lymphocyte 


proportions within a healthy immune system described by Berrington et ah 140 and 


further validate the effectiveness of the clustering and matching algorithms. Note 


that, cell clusters are actually dehned in hve dimensions (Fig. 3.3), but I show a 
simplihed 2-D projection for visualization purpose. 

In Figure |4.3 it may seem trivial to match these three clusters across samples 
simply by visual inspection. However, the clusters are dehned in hve dimensions and 
therefore it is impractical to visually match the clusters. For example, we are unable 
to distinguish helper and cytotoxic T cell in the 2-D scatter plot shown in Figure 


4.3 because we need additional set of markers (CD4 and CDS) to diherentiate them 
from one another. Furthermore, the matching algorithm uses statistical parameters of 
clusters to compute their dissimilarity, which is not easy to perceive by visualization. 
The MFC algorithm is completely unsupervised and matches clusters solely based on 
their statistical parameters without using the biological labels (e.g., T cell, B cells 
etc.) of the clusters. However, if the cluster labels are available for a sample, then 
the algorithm has the ability to label another sample based on the matching pattern, 
thus solving the cluster labeling problem. 


4.5.2 Registering populations before and after stimulating T cells 

Data description: As a second example, I use MFC algorithm to register cell pop¬ 
ulations across two four-dimensional samples collected before and after stimulating T 
cells in whole human blood. These two samples are part of the TCP dataset described 


in Section |1.6.2| where whole blood was collected from 29 subjects and stained using 
labeled antibodies against CD4, CD45RA, SLP-76, and ZAP-70 protein markers be¬ 
fore and hve minutes after stimulation with an anti-CD3 antibody. The downstream 
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Figure 4.4. Registering (matching) populations by the MFC algorithm 
across a pair of samples before and after stimulating T cells with an anti- 
CD3 antibody. Matched clusters are marked with same color and are 
joined them by a line with arrowheads. The cell populations are dehned 
in four dimensions, but I show a 3-D projection for visualization purpose. 


phosphorylation event caused by the stimulation increases the levels of SLP-76, and 


ZAP-70 proteins in different T cell subsets 44 


I randomly selected a pair of samples from the same subject before and after the 
stimulation. Each sample is clustered independently to obtain four cell clusters with 
the following expression prohles: (a) CD4+ CD45RA^°’", (b) CD4+ CD 45 RA*^^sh^ 
CD4“ CD45RA'^^®*^, and (d) CD4“ CD45RA^°’". (Recall that ‘-I-’ and ‘high’ indicate 
higher abundances of a marker, and ‘ —’ and ‘low’ indicate lower levels of it.) 

Registering populations: I select unmatch-penalty A for the TCP dataset following 


the similar procedure discussed in Section |4.5.1[ The MEC algorithm is then used to 
register cell population across the selected pair of samples. For visualization purpose, 
I show the matching solution in three-dimensional projections (CD4, CD45RA and 


SLP-76) in Fig. 4.4, where matched clusters are shown in same colors. By comparing 


the matched clusters in Fig. 4.4, we observe an increased level of SLP-76 proteins after 


the stimulation (and ZAP-70, although this is not included in the Fig.). The mapping 
of clusters across pre- and post-stimulation samples can assess the population-specihc 
effects of the stimulation experiment. 
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Figure 4.5. Registering different clusters between normal and AML sam¬ 
ples. Matched clusters are marked with same color. The proportion of 
myeloid blasts (shown in red) increases signihcantly in the AML sample. 


4.5.3 Registering population in diagnosing Acute Myeloid Leukemia (AML) 

Acute Myeloid Leukemia (AML) is often diagnosed by abnormal increase (greater 
than 20% of the white blood cells) of premature myeloid blast cells 141, 142| . In flow 
cytometry, the myeloid blasts can be identified in SS vs. CD45 plot. Here, the side 
scatter (SS) measures the granularity of cells and CD45 is a marker for the leukocytes. 
To show the hrst step of AML diagnosis, I use a normal and an AML samples from 


the AML dataset 32 . I only consider SS channel and CD45 marker for this analysis. 


I cluster the healthy and the AML samples with the k-means algorithm and obtain 
hve clusters in each sample. The hve clusters represent lymphocytes, lymphoid blasts, 
myeloid cells, myeloid blasts, and monocytes. The myeloid blasts are identihed by 
medium side scatter and medium CD45 expressions. Next, I match cell clusters in 
order to detect same cell types between the samples. The result is illustrated in 
Fig. |4.5[ where a pair of matched clusters is shown in same color. After registering 
clusters, we observe that the fraction of myeloid blasts (shown in red) increases from 
1.5% in healthy sample to 28% in the AML sample. This observation conhrms that 
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the second sample has Acute Myeloid Leukemia. The mixed edge cover algorithm is 
able to detect abnormally behaving cell populations indicative AML and therefore, 
can be used in automated diagnosis of this cancer. 

4.6 Conclusions and future work 


Population registration or labeling is a fundamental problem in flow cytometry 
since it is used to evaluate population level differences across biological conditions. 
To solve this problem computationally, I have developed a robust mixed edge cover 
(MEC) algorithm that registers high-dimensional clusters across a pair of FC samples. 
The MEC algorithm uses a robust graph-theoretic framework to match a cluster from 
a sample to zero or more clusters in another sample and thus solves the problem of 
missing or splitting clusters. I demonstrated, with three representative datasets, that 
this algorithm correctly matches phenotypically similar clusters despite heterogeneity 
across samples. Given a sample with known cluster labels, the MEC algorithm labels 
clusters from another sample, thus solving the cluster labeling problem. 

Beside matching and labeling populations, an optimal MEC solution computes 
a combinatorial dissimilarity between a pair of FC samples. This between-sample 
dissimilarity works as a building block of the meta-clustering and classihcation ap¬ 
proaches discussed in Chapter and Chapter 

MEC algorithm has three limitations. First, it does not use the proportion of 
cells in the populations while matching them across samples. It has been observed 
that samples from same biological status approximately preserve the proportion of 


cells in different populations (e.g., see the discussion in Section 4.5.1 and also by 


Pyne et ah [^). Hence, cell proportions can be used directly in Eq. 4.4 to provide a 
second level of verihcation. However, care must be taken when using cell proportion 
to match populations from different disease status. For example, the proportion 
of CDd"*" cells reduces signihcantly after HIV infection, therefore this proportion of 
CD4+ cells will not match between healthy subject and HIV patient. Second, MEC 
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algorithm can match populations more accurately if it uses the relative arrangement 
of populations within a sample. In this approach, each sample is modeled by a 
network of populations and two such networks are aligned to match populations 
across samples. This approach has been extensively used to align protein and gene 


regulatory networks 143,144 . Aligning two networks, however, is an NP-complete 
problem; therefore, approximation algorithms are often used to align large networks. 
Finally, the dissimilarity measure given by the MFC solution is not a metric since it 
fails to satisfy triangle inequality. Distance metrics are especially useful for creating 
hierarchical organization of samples, as will be discussed in next Chapter 

In current work, I am improving the MFC algorithm by addressing the limitations 
discussed in the previous paragraph. In particular, I am developing a between-sample 


distance metric similar in spirit of the earth mover’s distance 133,139 such that 
the combinatorial dissimilarity measure becomes a metric, as well as remains robust 
similar to the MFC. 
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5 META-CLUSTERS AND CLASS TEMPLATES 


5.1 Introduction 


Consider a collection of flow cytometry samples belonging to a few representa¬ 
tive classes with each class of samples indicating the same biological status (e.g., a 
disease). I describe an algorithm that builds a template for each class of samples 
by emphasizing the common properties of the class while omitting sample-specihc 


details 23,27,36 . Samples within a class usually have shared populations expressing 


similar phenotypes, whereas samples from different classes contain some unrelated 
populations. Cell populations with similar phenotypes in different samples of a class 
can be combined into a meta-population (also called a meta-cluster) representing 
a generic phenotype shared by the populations. A set of relatively homogeneous 
meta-clusters forms the core pattern of a class of samples and groups together into a 
prototypic template. I summarize the concepts of meta-cluster and template in Table 
5.1 and in Figure 5.1[ a). 

I have developed a hierarchical matching-and-merging (HM&M) algorithm for con¬ 
structing templates from a collection of samples. The algorithm repeatedly merges 
the least dissimilar (most similar) pair of samples not already included in a template. 
The dissimilarity between a pair of samples is computed by the cost of an optimal 
mixed edge cover (MEC) solution discussed in Chapter Towards this end, I as¬ 
sume that samples belonging to a particular class have smaller dissimilarity among 
themselves than samples from other classes. The HM&M algorithm builds a binary 
template-tree representing the hierarchical relationships among the samples. A leaf 
node of the template-tree represents a sample and an internal (non-leaf) node rep¬ 
resents a template created from the samples. Figure |5.lK b) shows an example of a 
template-tree created from four hypothetical samples. Si, S 2 , S 3 , and S'4. An internal 
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Table 5.1 

Summary of concepts - cell population, sample, meta-cluster, and tem¬ 
plate - used in this chapter. 


Terms 

Meaning 

Cell population 

(cell cluster) 

a group of cells expressing similar features, e.g., helper T cells, 

B cells, etc. 

Sample 

a collection of cell populations within a single biological sample 

Meta-cluster 

a set of biologically similar cell clusters from different samples 

Template 

a collection of meta-clusters from samples of same class 


node in the template-tree is created by matching similar cell clusters across the left 
and right children and merging the matched clusters into meta-clusters. Fig. 5.1[ c) 
illustrates the creations of the internal node T^S^, S 4 ) in Fig. |5.lK b) by matching clus¬ 
ters across samples S'3 and 54 . The root of the template-tree dehnes a class-template 
when all samples in the leaf nodes belong to the same biological class, and otherwise, 
the tree can be cut at an appropriate height to discover multiple class templates. 

Just as a sample is characterized by a mixture of cell populations, a template 
is characterized by a hnite mixture of homogeneous meta-clusters. The homogene¬ 
ity of meta-clusters needs to be evaluated statistically since there is no accepted 
method for evaluating the biological homogeneity of clusters. However, conventional 
methods for statistical signihcance testing (e.g., F-test or paired t-test) have a high 
probability of making a Type I error when used to evaluate the homogeneity of a 


meta-cluster because clusters in an FC sample are typically large 145,146 . Instead, 
the ratio of between-cluster to within-cluster variances, within a multivariate analysis 
of variance (MANOVA) model, can evaluate the homogeneity of a meta-cluster more 
effectively |147| . Based on a baseline experiment with a healthy dataset, I discuss a 
tolerable limit for the between-cluster variance in a homogeneous meta-cluster. 


The ability to systematically characterize a class of multi-dimensional samples 
with a well-defined template is useful in various applications of flow cytometry. In 
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(a) Terminology 

A meta-cluster 
A cluster 

A template 



A cluster 
(population) 

Acell — ■© 


A sample 


(b) Template tree 



Si S2 S3 S4 


(c) One phase of the 
HM&M algorithm 



• - • 


S3 S4 


Matched clusters 
are merged to get 
meta-clusters 


Cluster matching 
by MEC algorithm 


Figure 5.1. (a) The concepts of sample, cluster, template, and meta¬ 

cluster are illustrated graphically. Cells are denoted with dots, clusters 
with solid ellipses and meta-clusters with dashed ellipses, (b) An example 
of a hierarchical template-tree created from four hypothetical samples 
S'i,S' 2 ,S '3 and 5 * 4 . A leaf node of the template-tree represents a sample 
and an internal (non-leaf) node represents a template created from its 
children in the tree, (c) One step of the HM&M algorithm creating a 
template T^S^, S^) from a pair of samples S 3 and S'4. The algorithm hrst 
matches clusters across S 3 and S '4 by the mixed edge cover algorithm and 
then merges the matched clusters to construct new meta-clusters. 


addition to providing a cogent description of the core population structure that is 
shared among samples within a class, the templates also allow an objective way of 
assessing the overall differences in those structures across classes. I demonstrate the 
application of templates with a healthy donor (HD) dataset and a T cell phospho¬ 
rylation (TCP) dataset. For the HD dataset, the algorithm is able to construct hve 
well-separated templates for hve healthy subjects, despite different sources of within- 
subject variations. For the TCP dataset, the algorithm creates pre- and a post¬ 
stimulation templates from 29 pairs of samples collected before and after stimulating 
blood cells with an anti-CD3 antibody. By matching meta-clusters across templates, 
we can better assess the population-specihc effects of the stimulation experiment. 

Related work: Several researchers have used the concepts of templates and meta¬ 


clusters to summarize collection of how cytometry samples 23,27,39. FLAME, 
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proposed by Pyne et al. pools cluster medoids from a class of samples and 
applies a second stage of clustering on the medoids to construct the global meta¬ 


clusters. flowTrans, developed by Finak et al. 23 , starts with seed meta-clusters 
and assigns each cluster to a meta-cluster to which it is most similar. The HM&M 
algorithm is signihcantly different from both FLAME and flowTrans in several ways. 
First, FLAME and flowTrans both build a single template from samples of the same 
class. Therefore, these algorithms need to know the class label of each sample, which 
is often unknown in practice. In contrast, the HM&M algorithm is able to identify 
templates as the roots of well-separated branches of a template-tree in an unsupervised 
manner. This approach also permits multiple templates representing different sub¬ 
classes within a single class, and therefore, is more flexible to capture sample diversity. 
Second, instead of clustering population centers, the HM&M algorithm optimally 
matches populations across samples and then merges the matched clusters into meta¬ 
clusters (see Figure 5.1[ c) for an example). Like FLAME, but unlike flowTrans, 
this algorithm allows a cluster forming a self-contained meta-cluster by itself when 
the cluster is signihcantly different from all other clusters. Finally, the hierarchical 
organization of samples shown in Fig. |5.1[ b) provides additional hexibility in creating 
multi-layer templates, classifying samples, and updating templates dynamically. 

The rest of the chapter is organized as follows. In Section [T2, I describe a hi¬ 


erarchical algorithm for creating templates from a collection of samples. Section 5.3 


discusses a statistical framework for analyzing the homogeneity of meta-clusters. I 
demonstrate the application of templates with two representative FC datasets in Sec¬ 


tion 5.4 I hnish this chapter in Section 5.5 with concluding remarks. 


5.2 An algorithm for constructing templates 

5.2.1 Dissimilarity between samples or templates 

The template construction algorithm characterizes both a sample and a template 
with a hnite mixture of multivariate normal distributions each component of which 
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is a cluster or a meta-cluster. To compute the dissimilarity between a pair of samples 
or templates, I create a complete bipartite graph with the clusters (meta-clusters) in 
each sample (template) as vertices, and each edge is weighted by the Mahalanobis 
distance between its endpoints. The cost of an optimal mixed edge cover (MEC) in 
the bipartite graph gives the dissimilarity between the samples (templates). This dis¬ 
similarity measure is similar in spirit with the R-metric, transfer distance or partition 
distance that computes the minimum number of augmentations and removals of cells 


needed to transform one partition into another 121-123 . However, the partition 


distance can only compare two partitions of the same sample, whereas MEC based 
dissimilarity can work with partitions from the same sample, or from two different 
samples. Partition distance matches a cluster to at most one cluster, while MEC is 
able to match a cluster to zero or more clusters. Therefore, the latter approach is 
more robust in the presence of missing or splitting cell populations. 


5.2.2 The hierarchical matching-and-merging (HM&M) algorithm 

Consider a collection of N flow cytometry samples {Si, S 2 , ■■■, Sn}- Each sample 
is clustered independently to identify phenotypically distinct cell populations. The 
HM&M algorithm organizes the samples into a template-tree data structure as was 


described in Fig. 5.1 Each node Vi in the template-tree represents either a sample 
(leaf node) or a template (internal node). Let Vi consist of ki clusters or meta-clusters, 
Vi = {c\, C 2 , ..., c\.}. Vi is called an “orphan” if it does not have a parent in the 
template-tree. At each step, the HM&M algorithm identihes the most similar pair of 
orphan nodes and merges them into a new node denoting a template of the merged 
nodes. This process continues until a single orphan node (root) remains. The HM&M 
algorithm can be described with the following three steps. 

1. Initialization: Create a node Vi for each of the N samples Si. Dehne the set of 
orphan nodes. Orphan = {^ 1 ,^ 2 , Repeat the following matching and merging 

steps until a single orphan node remains. 
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2 . Matching: Compute the dissimilarity D{vi,Vj) between every pair of nodes Vi 
and Vj in the current Orphan set by the cost of an optimal mixed edge cover solution. 

3. Merging: Find a pair of orphan nodes {vi,Vj) with minimum dissimilarity 
D{vi,Vj) and merge them to create a new node vi. Let mec be a function denoting 
the mapping of clusters from Vi to Vj. That is, if G Vi is matched to G Uj, 
then = mec(c^), where 1 < x < ki and 1 < y < kj. Create a new meta-cluster c[ 
from each set of matched clusters, i.e., c[ = {cl Umec(c^)}. Let ki be the number 
of new meta-clusters created above. Then the new node vi is created as a collection 
of these newly created meta-clusters, i.e., vi = {c[,cl, ...,cl^}. The distribution pa¬ 
rameters, (/i^, S^), of each of the newly formed meta-clusters c{ are estimated by the 
Expectation-Maximization (EM) algorithm. The height of vi is set to D{vi,Vj). The 
node vi becomes the parent of Vi and Vj, and the set of orphan nodes is updated to 
orphan = orphan \ {vi,Vj} U {n^}. If there are orphan nodes remaining, we return to 
the matching step, and otherwise, we terminate. 

5.2.3 Creating templates from a template-tree 

Let the samples {S'!, S 2 , ..., Sn} belong to M disjoint classes where M may or may 
not be known. Then, the following three cases are considered to create templates from 
the template-tree. 

(a) Class label of each sample is known: A template-forest with M disjoint trees is 
constructed where each tree includes samples belonging to the same class. The roots 
of the M trees represent the templates of the M classes of samples. 

(b) M is known but the class label of each sample is unknown: A single template- 
tree is created from N samples. Then, the tree is cut at a suitable height so that M 
disjoint subtrees are produced. The root of each subtree represents a template of the 
samples placed in the leaves of that subtree. 

(c) Both M and the class labels of the samples are unknown: A single template- 
tree is created from N samples. The templates are created from the roots of the well- 
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separated branches of the template-tree such that within-class variations (heights of 
the subtrees) are small relative to the between-class variations (heights of the ances¬ 
tors of the subtrees). In this context, the HM&M algorithm is similar to the spirit 


of the hierarchical clustering algorithm UPGMA 33,95 , with signihcant differences 
in the distance computation and management of the internal nodes. Assume that 
M well-separated templates are produced from the template tree. Then, N sam¬ 
ples are predicted to be generated from these M classes. This approach leads to an 
unsupervised classihcation of samples discussed in the next chapter. 


5.2.4 Computational complexity 

The HM&M algorithm requires 0{N‘^) dissimilarity computations and 0{N) 
merge operations for creating a template from a collection of N samples. Let k 
be the maximum number of clusters or meta-clusters in any of the nodes of the 
template-tree. Then a dissimilarity computation takes 0{k^ log k) time whereas the 
merge operation takes 0 {k) time when distribution parameters of the meta-clusters 
are computed by maximum likelihood estimation. Hence, the time complexity of the 
algorithm is 0{N‘^k^ log k), which reduces to 0{N‘^) for bounded k. 

5.3 The homogeneity of meta-clusters and templates 

A template summarizes the common features of a group of similar samples. These 
common features are captured by a collection of meta-clusters (meta-populations) 
formed by combining cell clusters expressing similar phenotypes in different samples. 
If a template correctly represents a collection of samples with the same biological 
status, each meta-cluster would contain a homogeneous collection of cell clusters with 
similar phenotypes. However, there is no standard method for computing biological 
homogeneity of a collection of cell populations. Therefore, I consider a statistical 
procedure, in a Multivariate Analysis of Variance (MANOVA) model, to evaluate 





85 


the homogeneity of meta-clusters. For simplicity, I hrst discuss the homogeneity of 
one-dimensional meta-clusters, followed by the multi-dimensional case. 


5.3.1 Analyzing homogeneity of a one-dimensional meta-cluster 

Let C be a meta-cluster consisting of k one-dimensional (1-D) clusters Ci,C 2 , ..Ck 
dehned within a single marker/channel. A 1-D cluster usually represents a distri¬ 
bution of cells with a marker either being expressed or not being expressed. For 
example, CDS’*" and CD3“ represent two 1-D clusters in CD3 channel, with the for¬ 
mer expressing high levels of CD3 marker and the latter expressing low levels of it. 
CD3 is a marker of T lymphocytes, and therefore, a meta-cluster denoting T cells is 
biologically homogeneous if it contains only CDd"*" cell clusters. 

Let the cluster Ci & C contain rii cells and be normally distributed with mean fii 
and variance cxf. The normality assumption was justihed in Chapter]^ where I showed 
that cell clusters usually follow normal distribution after stabilizing their variances. 
In spite of variance stabilization, I still use af to denote the variance of the cluster 
(instead of using a common variance) to emphasize that the variances can not be 
made exactly equal by the variance stabilization process. The mean and variance 
of each cluster are estimated by the E-M algorithm (alternatively, with an unbiased 
maximum likelihood estimator) after the clusters are identihed as was discussed in 
Chapter 

Let the entire meta-cluster C contain n cells in total with mean jj,. The separation 
of a cluster Cj from the meta-cluster C can be approximated by the squared deviation 
of their centers (/i* — /i)^. In behavioral and biological sciences, (/i* — /i)^ is generally 


called the “effect-size” because it indicates the effect of the treatment 147 . In 


the meta-clustering context, it denotes the “cluster separation” of the cluster from 
the meta-cluster center. The between-cluster variation of a meta-cluster is then 
computed with the average separation of the clusters: 

h 

O 1 


U. = 


n — k 


^{rii - l)(/ii -/i)l 


2 = 1 


(5.1) 
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The within-cluster variance, is estimated by a weighted average of all individual 
cluster variances (pooled variance): 

, k 


2 _ 

^ n-k 


—t E -1) 




(5.2) 


i=l 


I calculate the “Relative Cluster Separation” (0) after dividing al by cx^ and taking 
a square root of the ratio: 

0=^. (5.3) 

Note that the square of the above ratio, 0^, is also known as the signal-to-noise ratio 
(SNR) or Cohen’s p statistics |147| . 0 is a dimensionless number and can take 
values between zero (when the clusters collapse into a single point) and an indefinitely 
large number (when the clusters become further apart relative to a^). For example, 
0 = 0.1 conveys that the average separation among clusters (ai,) is one-tenth of 
their (pooled) standard deviation (cx^), indicating a signihcant overlap (homogeneity) 
among the clusters. In contrasts, 0 = 2 conveys a relatively inhomogeneous group of 
clusters because the average between-cluster separation is twice as large as the within- 
cluster standard deviation. A meta-cluster is considered relatively homogeneous when 
0 is small relative to a predefined threshold. For FC data, this threshold can be set to 
one, i.e., a collection of clusters is relatively homogeneous when 0 < 1 , and otherwise. 


inhomogeneous. I justify this choice of threshold in Section 5.4.1 with a healthy 
dataset. A guideline, although controversial, for setting the threshold at three levels 
- .1 for small, .25 for medium and, .4 for large effect - is suggested by Cohen for 


behavioral studies 147 . However, I found these levels too small for relatively large 
cell clusters observed in FC samples. 

I compute the uncertainty associated with 0 by computing the confidence interval 
around it. Assume, for simplicity, that the clusters within a meta-cluster have equal 
sizes n' ( here, n' = n/k^ where n is the total size of the meta-cluster and k is the 
number of cluster in the meta-cluster). Then the conhdence interval for 0 is calculated 
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from the confidence interval of the non-central F distribution whose non-centrality 
parameter A is calculated with: 

A = ^ = (5.4) 

(7‘^ 

^ w 

Let [Xl, A;/] be the 100(1 —a)% conhdence interval for A, then the following equations 
hold: 


p[Xl < a < Xu] = 1 


a 


P 









n 


= 1 — a. 


(5.5) 

(5.6) 


Hence, the conhdence interval [0/., for 0 is calculated with 


146 


149 


Note that I employ 0 to evaluate the homogeneity of a meta-cluster obtained from 
any meta-clustering algorithm. However, a meta-clustering algorithm can directly 
minimize 0 and obtains the “optimum” homogeneity of meta-clusters. This approach 
is very similar to Fisher’s linear discriminant analysis that maximizes the separation 
between clusters relative to within-cluster variations [^. I leave this optimization 
approach as a future work and will not discuss it any further. 


5.3.2 Comparison with null hypothesis based signihcance testing 

Consider a meta-cluster consisting of k cell clusters, where the 0^ cluster contains 
Hi cell with mean /ij. To evaluate the homogeneity of this meta-cluster in an Analysis 
of Variance (ANOVA) model, we can consider a null hypothesis of no difference among 
cluster centers: 

Hq pi = p2 = ■ ■ ■ = Pk- ( 5 - 7 ) 

The normality and homoskedasticity assumptions required for the ANOVA model are 
approximately satished by the variance stabilization step discussed in Chapter The 
validity of the above null hypothesis can be tested by an F-test that measures the 
following F-statistics: 

k 

l^Y.ni{pi- pf 

p . =_ 


(5.8) 











where, n is the mean of the entire meta-cluster and cr^ is the pooled variance of 


the clusters as defined in Eq. 5.2 The p-value of the F-test is computed with the 
probability of obtaining a test statistic at least as extreme as Fobs assuming that the 
null hypothesis is true, i.e., we compute P{F > Fobs), where F is a random variable 
following F-distribution with k — 1 and n — k degrees of freedom. In a meta-cluster 
analysis, p-value evaluates whether the difference among cluster locations can occur 
by chance. The null hypothesis FIq is rejected when the p-value is less than a certain 
signihcance level (often 0.05 or 0.01). Hence, we can declare a meta-cluster to be 
inhomogeneous {FIq is rejected) when the p-value of the F-test is less the significance 
level, and otherwise, declare it to be homogeneous {Hq in not rejected). 

The F-test depends on the size of clusters because the cluster size (uj) appears on 


the numerator of Fobs in Eq- 5.8 In the Null Hypothesis based Signihcance Testing 
(NHST) framework, large clusters produce high Fobs (low p-value of the F-test) and 
therefore, tend to declare a meta-cluster to be inhomogeneous even though the average 
cluster-separation is relatively small. As a matter of fact, given sufficiently large 
cluster sizes, an F-test always rejects Hq unless the locations of the clusters collapse 
at a single point, which is extremely unlikely in practice. In how cytometry, the 
number of cells in a cluster is typically large, often in the range of tens of thousands 
cells. Hence, NHST has a high probability of making a Type I error, even at a very low 
alpha such as at .001, and incorrectly declares a meta-cluster inhomogeneous when 
the cluster sizes are large. In contrast to the F-test, the relative cluster separation (0) 
does not depend on the cluster sizes and is therefore more applicable to meta-cluster 
analysis in how cytometry. 

To see the relationship between 0 and the paired t-test, consider a meta-cluster 
consisting of two equal-size [n') clusters, with means pi and /i 2 and a common variance 
cr^. In this case, Eq. |5.3| simplihes to 

, _ 11/^1 — /^21 _ d 


a 


(5.9) 
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where d is the standardized mean difference between two clusters, also known as 


Cohen’s d 147 . In this settings, a two-group t-test statistics (tabs) is computed by 

(5.10) 


_ n' III -112 
^obs — \l ^ ■ 


a 


Similar to Fobs in Eq- 1^. tabs depends on the cluster sizes. Consequently, the t-test 
can also report a small difference to be highly significant when the cluster sizes are 
large relative to the within-cluster variance. Therefore, 0 can also be used to assess 
the separation of a pair of clusters instead of traditional t-test. 


5.3.3 Analyzing homogeneity of high-dimensional meta-clusters 


I analyze the homogeneity of high-dimensional meta-clusters in a fixed-effect Mul¬ 


tivariate Analysis of Variance (MANOVA) model 150 . As before, let C be a p- 


dimensional meta-cluster consisting of k clusters, ci,C 2 ,..Cfe, with the cluster q 
containing n* cells and the size of the entire meta-cluster to be n = Y2i=i 
sume that the cluster c* is modeled by a multivariate normal distribution with a 
p-dimensional mean vector and apxp covariance matrix Sj. The mean vector ^ of 
the entire meta-cluster C is estimated by ^ Similar to the one-dimensional 

setting, I compute the within-cluster covariance matrix, and the between-cluster 
covariance matrix, Ylb, as follows: 


— 


_ h 


n — k 


2 = 1 


E, = 


1 


n — k 




(5.11) 


(5.12) 


2=1 


The relative cluster separation, 0, in multidimensional setting is then computed with: 

0 = y^trace(S-iSfe). (5.13) 


In MANOVA, trace(S“^S6) is known as the Lawley-Hotelling statistics 150 152 
and is computed by the summation of the eigenvalues of I divide the Lawley- 

Hotelling statistics by p so that 0 becomes independent of the dimensionality and 
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take a square root of it to keep it in the original unit of measurement. Similar to the 
univariate case, 0 < 1 denotes that the between-cluster variation is less than naturally 
occurring within-cluster variation, and therefore, the meta-cluster can be considered 


homogeneous. As pointed out by Ito et ah 153, p. 75-78], 0 is a robust estimator of 
the homogeneity of meta-clusters even in the presence of limited heteroskedasticity 
of covariances in large clusters. Therefore, this measure can be applied to analyze 
homogeneity of heteroskedastic clusters as well. 

In the multi-dimensional setting, computing an exact conhdence interval around 
0 is difficult because the exact distribution of the Lawley-Hotelling statistics is not 


known for general k and p. According to the discussion in 152,154,155 , I ap- 


proximatethe distribution with a non-central high-dimensional F distribution with 
degrees of freedom a and b and the non-centrality parameter. A, where a = p{k — 1), 
b = pin — k — p — 1) + 2, and A = 0^(a -f 5 -|- 1). The conhdence interval [0i,, 0;/] 
around 0 is therefore computed with where [Xl,^u] is the conh¬ 

dence interval of the non-centrality parameter, A. I calculate the conhdence interval 

. Note that = (n — /c)trace(S“^S6) is 


for A by using the R package MBESS 


156 


Hotelling’s generalized statistics 150-152 and is equivalent to the F statistics 


m 


the univariate case. Similar to the univariate case, also depends on cluster sizes 
and has a high probability of making a Type I error by declaring trivial diherences 
as statistically signihcant when cluster sizes are large. 


5.4 Results 

5.4.1 Creating templates from healthy samples 

Data description: The healthy donor (HD) dataset consists of 65 samples from hve 
healthy individuals who donated blood on diherent days. Each sample was divided 
into hve replicates and each replicate was stained using labeled antibodies against 


CD45, CDS, CD4, CDS, and CD19 protein markers. Section 1.6.1 provides a detail 
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(a) Template tree 



(b) Meta-clusters 
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Figure 5.2. (a) The template-tree created by the HM&M algorithm from 
all samples of the HD dataset. Leaves of the dendrogram denote samples 
from hve healthy individuals. An internal node represents a template and 
the height of an internal node measures the dissimilarity between its left 
and right children. The sample-specific subtrees are drawn in different 
colors, (b) Bivariate projections of the combined healthy template (the 
root of the tree in Subfig. (a)) are drawn in terms of the meta-clusters. 
Here, each meta-cluster is represented by a homogeneous collection of cell 
clusters that are drawn with the 95th quantile contour lines. Clusters 
participating in a meta-cluster are drawn in same color. 


description about this dataset. Each sample of the HD dataset was preprocessed, 


variance stabilized, and clustered to identify cell populations (see Fig. 3.3). 

Creating healthy templates: Figure m a) shows the template-tree created from 
the HD dataset. A leaf node of the tree denotes a healthy sample, and an internal 
node represents a template created from samples placed in the underlying subtree. 
The height of an internal node measures the dissimilarity between its left and right 
children. We observe that 65 samples from the five healthy subjects are organized 
into five well-separated branches shown in five different colors in Fig. 5.2[ a). From 
the roots of these five subtrees, we can construct five subject-specific templates, e.g., 
Ta represents a template created from 15 samples from subject A . In this dataset. 
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the variations within a subject-specihc template arise from the environmental impact 
on individual immune system on different days and the technical variation in flow 
cytometry sample preparation and measurement. By contrast, the between-subject 
variations arise from the natural biological variations among healthy subjects. In this 
dataset, we observe more natural between-subject variations than the temporal and 
instrumental variations. Hence, samples from the hve subjects create concise and well 
separated templates representing immune prohles for different healthy individuals. 

The HD dataset includes three sources of variations - the technical, day-to-day 
within-subject, and between-subject variations. The template-tree captures these 
multi-level variations with different internal nodes that can be used as multi-layer 
templates. In the lower level, a day specihc template is constructed from hve replicates 
of a sample collected on a particular day from a subject. In the middle level, samples 
from a subject collected on different days create a subject-specihc template (the roots 
of colored subtrees in Fig. 5.2[ a)). Finally, in the top level, the root of the whole tree 
represents a combined template representing a healthy immune prohle of these hve 
subjects. This multi-level construction is especially useful to design a multi-level 
sample classihcation framework, which is the topic of the next chapter. 

Meta-clusters in the healthy template: The template created from the HD dataset 
(the root of the tree in Fig. |5.2[a)) represents a healthy immune prohle of the hve 


subjects. This template consists of four meta-clusters denoting four biologically dis¬ 
tinct cell sub-types within lymphocytes (we have isolated lymphocytes on the scat¬ 
ter channels in the pre-processing steps). Each meta-cluster is a collection of bi¬ 


ologically equivalent cell populations from diherent samples. Figure 5.2 b) shows 
2-D projections of these meta-clusters in terms of the participating clusters. The 
95th quantiles of the clusters within each meta-cluster are shown in same color 
denoting (1) green: CDS’*" T cells (CD45+CD3’''CD8’''), (2) black: CD4+ T cells 
(CD45+CD3+CD4+), (3) blue: B cells (CD45+CD3-CD19+), and (4) red: natural 
killer cells (CD45’''CD3“CD19“). These meta-clusters represents core population 
pattern of healthy lymphocytes excluding the subject-specihc variations. 
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Figure 5.3. The homogeneity (Relative Cluster Separation, 0) of four 
healthy meta-clusters shown in different panels. The meta-clusters are 
created from three different groupings of the healthy samples shown in 
different colors. 


Homogeneity of meta-clusters: How homogeneous are the meta-clusters contained 
in different templates of the HD dataset? Here, I discuss the homogeneity of three 
levels of templates: (a) the healthy template created from all 65 samples, (b) 5 subject- 
specific templates, each of which is created from samples from a subject, and (c) 
13 day-specific templates, each of which is created from five replicates of a sample 
collected on the same day from a subject. Every template consists of four meta¬ 


clusters similar to the healthy template described in Fig. 5.2 b). We can evaluate the 


homogeneity of the high-dimensional meta-clusters in each template with the Relative 


Cluster Separation (RCS), cf, as discussed in Section 5.3.3 


Fig. 5.3 shows the RCS (0) of four meta-clusters (different panels) denoting four 
subtypes of cells across three groups of templates (different colors). For the subject- 
specific and day-specific templates, I computed the the average RCS of the meta¬ 
clusters in each group of templates. Recall that a high value of 0 indicates low 
homogeneity (high inhomogeneity) of a meta-cluster and vice versa. We observe, in 
each panel of Fig. 5.3| that a meta-cluster becomes increasingly inhomogeneous (i.e., 
0 decreases) as we increase variations from different sources. This is expected because 
variations in data (even though from natural sources) reduce the uniformity of the cell 
populations. However, we do not expect any biologically significant variation within 
a meta-cluster (denoting a cell-type) because the observed variations are originated 
from natural sources among healthy individual. The highest value of 0 is less than 
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one for any meta-cluster in Fig. 5.3 Therefore, we consider a meta-cluster to be 
biologically homogeneous when 0 < 1. That is, a meta-cluster is homogeneous when 
the between-cluster variation is less then the pooled within-cluster variation. This 
assertion is satished by the healthy meta-clusters, as expected, because all samples 
are from the same biological status. 

The threshold for 0 can be adjusted for a biological experiment depending on the 
natural variability expected in the experimental condition. Therefore, it is advisable 
to run a pilot study similar to the HD dataset to determine baseline/trivial homogene¬ 
ity measurements, which can subsequently be used to assess biologically signihcant 
variations across different classes of samples. 


5.4.2 Creating templates from the TCP dataset 


Data description: T cell phosphorylation (TCP) dataset was originally generated 


by Maier et al. 44 to study the effect of phosphorylation upon stimulating blood 


with anti-CD3 antibody. The abundance of four protein markers (CD4, CD45RA, 
SLP-76, and ZAP-70) was measured before and hve minutes after stimulating whole 


blood with an anti-CD3 antibody 44 . I reanalyzed 29 pairs of samples of this dataset 


to demonstrated the creation of templates and the detection of general effect of stimu¬ 


lation at the meta-cluster level. Section [1. 6. 2| provides a detail description about this 
dataset. Each of the 58 samples in the TCP dataset was preprocessed and clustered 
independently to identify cell populations (Section [3.5.2 ). 

General effects of stimulation on phosphorylation responses: During the stimu¬ 
lation, anti-CD3 antibody binds with T cell receptors (TCR) and activates the T 
cells, initiating the adaptive immune response. The binding with TCR induces dra¬ 
matic changes in gene expression and cell morphology, and induces the formation of 
a phosphorylation-dependent signaling network via multi-protein complexes. ZAP-70 
is a kinase that phosphorylates tyrosine in a trans-membrane protein called LAT, and 
LAT and SLP-76 are part of a platform that assembles the signaling proteins (5^. 
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Previous studies have shown that different T cell subsets (naive, memory, effec- 

In 


tor) display different phosphorylation responses upon stimulation 44,157-159 


these studies, each sample was gated to identify cell populations of interest, and 
each pair of samples - before and after stimulation - were compared to detect the 


phosphorylation responses. However, Maier et ah 44 reported that the autoimmune 
disease-associated allele at CTLA4 gene on chromosome 2q33 alters phosphorylation 
responses in naive and memory T cells. Thus, depending on their genetic prohles, 
different subjects might display different phosphorylation responses upon stimulation. 
It is therefore challenging to summarize the general effect of phosphorylation from 
observing the stimulation effect in individual samples. An alternative and robust 
approach is to create a pre-stimulation and a post-stimulation template. By match¬ 
ing meta-clusters across these templates we can better assess the population-specihc 
effects of the stimulation experiment. 

Creating pre- and post-stimulation templates: By applying the HM&M algo¬ 
rithm, I create a pre-stimulation template from 29 samples before stimulation and 
a post-stimulation template from 29 samples after the stimulation. Both tem¬ 
plates consist of four meta-clusters denoting: (a) CD4’''CD45RA^°"' memory T 
cells, (b) CD4+CD45RA^*®^ naive T cells, (c) CD4“CD45RA^*®^ cells, and (d) 


CD4“CD45RA^°"' cells. Fig. 5.4.2 shows three-dimensional projections of these meta¬ 
clusters: pre-stimulation meta-clusters in blue and post-stimulation meta-clusters in 
red. We observe an increased levels of SLP-76 protein in every meta-cluster after 
stimulation than its pre-stimulation state. However, the increment is more clear in 


naive and memory T cells, as was observed in earlier publications 44,157-159 . Sim¬ 
ilar phosphorylation effect, although smaller than SLP-76, was observed for ZAP-70 
protein as well. Therefore, two templates can concisely represent stimulation states of 
58 samples in terms of the generic meta-clusters, and comparing meta-clusters across 
templates evaluates the overall phosphorylation shifts across conditions. 

Homogeneity of meta-elusters: How homogeneous are the pre- and post¬ 
stimulation meta-clusters? To answer the question more rigorously, I create a com- 
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(a) CD4+ CD45RA*°'" memory T cells 


(b) CD4+ CD45RA''*9^ naive T cells 



0.0 0.0 



(c) CD4- CD45RA'‘*9^ cells 


(d) CD4- CD45RA'°“ cells 


Figure 5.4. Three-dimensional projections of the meta-clusters created 
from the TCP dataset: Pre- and post-stimulation meta-clusters are shown 
in blue and red, respectively. The four meta-clusters represent: (a) 
CD4+CD45RA^°"' memory T cells, (b) CD4+CD45RA^*®^ naive T cells, 
(c) CD4-CD45RA'^*^?'* cells, and (d) CD4-CD45RA'°“' cells. 


bined template from all 58 samples in the TCP dataset. Table 5.2 shows the Rela¬ 
tive Cluster Separation (RCS, 0) with the 95% conhdence intervals for every four¬ 
dimensional meta-cluster from three templates (pre-stim, post-stim, and combined). 


The combined meta-clusters in Table 5.2 are less homogeneous (i.e., larger values of 


0) than the pre- and post-stimulation meta-clusters. As expected, the homogeneity of 
meta-clusters decreases with the increase of variations in the TCP dataset, as was also 
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Table 5.2 

Relative Cluster Separation (0) with 95% confidence intervals (Cl) for 
each of the four meta-clusters in the TCP dataset. Meta-clusters are 
computed from (a) all 58 samples across before and after stimulation, (b) 
29 samples before stimulation, and (c) 29 samples after stimulation. 


Meta-cluster 


0 [95% Cl] 



Before stimulation 

After stimulation 

Combined 

CD4+CD45RA'°’" 

CD4+CD45RA'**s'^ 

CD4-CD45RA'°’" 

CD4-CD45RA'**s'^ 

0.541 [0.537, 0.545] 

0.647 [0.642, 0.651] 

0.328 [0.324, 0.332] 

0.565 [0.561, 0.568] 

0.495 [0.491, 0.499] 

0.545 [0.540, 0.550] 
0.291 [0.287, 0.294] 

0.364 [0.360, 0.367] 

1.023 [1.018, 1.026] 

1.056 [1.051, 1.060] 

0.476 [0.472, 0.478] 

0.400 [0.397, 0.403] 


observed in the HD dataset. In particular, the homogeneity of meta-clusters denoting 
the naive and memory T cells decreases by a factor of two (twofold increment of 0) 
in the combined template. For these two meta-clusters, the values of RCS are greater 
than one (0 > 1) in the combined template indicating higher between-cluster variance 
than the within-cluster variance. Therefore, the combined template is biologically in¬ 
homogeneous based on the homogeneity threshold (0 = 1) discussed earlier for the 
HD dataset. The inhomogeneity in the combined template implies that the combined 
template is created from a heterogeneous collection of samples from more than one 
biological classes, and one should not build a single template from these collection of 
samples. Therefore, the RCS statistics provides an objective way of constructing and 
evaluating templates and prevents us from building templates from inhomogeneous 
collection of samples from multiple biologically distinct classes (such as the pre- and 
post-stimulation samples in the TCP dataset). 

5.5 Conclusions and future work 

Templates and meta-clusters can succinctly summarize a large collection of sam¬ 
ples belonging to a few biological classes. With a collection of homogeneous meta- 
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clusters, a template provides a concise description of a biological condition and reduces 
the complexity of data by emphasizing the key characteristics of a class while masking 
statistical noises and low-level details. I discuss a hierarchical matching-and-merging 
(HM&M) algorithm for constructing templates that describe distinct biological states 
available in a collection of samples. The HM&M algorithm can operate both in su¬ 
pervised (class labels of samples are known) and unsupervised settings (class labels 
of samples are not known). Templates created in an unsupervised setting can lead to 
a robust classihcation scheme, as will be discussed in the next chapter. 

A meta-cluster within a template performs as a blueprint for a particular type 
of cells and represents the generic cell population with a statistical description while 
omitting sample-specihc variations. Meta-clusters can be employed for measuring 
general effect on cell populations as biological conditions changes across templates. 
As an example, the memory and naive T cells upon stimulated by anti-CD3 antibody 
display increased levels of SLP76 and ZAP70 proteins, an indication of increased 
phosphorylation in these T cell subpopulations. Despite sample-specihc variations, 
the T cell subpopulations from 29 pre-stimulation samples can be encapsulated with a 
set of homogeneous meta-clusters. Therefore, we can evaluate the population-specihc 
effect of the stimulation by registering these meta-clusters with their corresponding 
meta-clusters in a post-stimulation template. 

I discuss a statistical framework, in an multivariate analysis of variance 
(MANOVA) model, for evaluating the homogeneity of meta-clusters. For a collec¬ 
tion of variance-stabilized clusters within a meta-cluster, this approach computes the 
ratio of between-cluster to within-cluster variance and uses the ratio to evaluate the 
homogeneity of the meta-cluster. Based on the natural variations among healthy in¬ 
dividuals, a meta-cluster can be declared relatively homogeneous when the ratio of 
between- to within-cluster variance is less than one. 


In continuing work, I plan to investigate the use of networks instead of trees to 
organize the templates, similar in spirit to the use of networks rather than trees in 
phylogenetics [160 . Another issue is that the combinatorial dissimilarity measure 
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between two samples is not a metric, and when the dissimilarity is extended to two 
templates, this value does not monotonically increase in the hierarchical matching and 
merging algorithm. I plan to investigate other dissimilarity measures in this regard. 
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6 CLASSIFYING FC SAMPLES BASED ON TEMPLATES 


6.1 Introduction 


Consider a collection of flow cytometry samples, each sample consisting of fluores¬ 
cence measurements of protein markers made at the single-cell level from hundreds 
of thousands of cells, indicating different cell types present in each sample. I describe 
an algorithm to dynamically classify such samples into several classes based on phe- 
notypically distinct templates. As described in Chapter the class templates can 
be constructed by hrst organizing the samples into a template-tree data structure, 
and then identifying the templates by the roots of the well separated branches. In 
this way, templates work as prototypes of different biological classes (e.g., disease 
status, time points, etc.) and emphasize the common properties of the class while 
omitting sample-specihc details. In this chapter, I discuss classifying new samples 
by comparing them with templates and dynamically updating the templates (and 
the template-tree) as new samples are classihed. The template-base classihcation is 
robust and efficient because it compares samples to cleaner and fewer class templates 
rather than the large number of noisy samples themselves. 

The concepts of cluster, meta-cluster, sample, and template are explained in Chap¬ 
ter 1^ Briefly, a template is a collection of relatively homogeneous meta-clusters 
commonly shared across samples of a given class, thus describing the key immune- 


phenotypes of an overall class of samples in a formal, yet robust, manner 23,27,36 


Here, a meta-clusters or meta-population is a generic (abstract) population formed 
by combining cell populations expressing similar phenotypes in different samples. 
Clusters participating in a meta-cluster usually represent the same type of cells and 
thus have overlapping distributions in the marker space. Therefore, a template is 
characterized by a mixture of (normally distributed) meta-clusters. 
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Besides their use in high-level visualization and between-class comparisons, tem¬ 
plates can be employed to classify new samples with unknown status. In this chapter, 
I use this approach to classify samples in terms of their expression of markers of the 
immune system. In the static classihcation approach, I build a hxed number of tem¬ 
plates, each representing samples from a particular class, and organize them into a 
template-tree data structure. A new sample is then predicted to come from a class 
whose template it is most similar to. In the dynamic classihcation approach, I update 
the templates and also the template-tree, as new samples are classihed. 

I demonstrate the use of template-based classihcation with two diherent datasets. 
The hrst dataset measures the diherences in phosphorylation events before and after 
stimulating T cells in human whole blood with an anti-CD3 antibody. By creating 
pre-stimulation and post-stimulation templates, we can classify samples according 
to their stimulation status. The second dataset studies the natural variations among 
diherent subsets of immune cells in hve healthy individuals. Blood was collected on up 
to four diherent days from each subject and hve technical replicates were created from 
each subject. Five templates could succinctly represent samples from hve individuals 
despite the within-subject temporal and technical variations, demonstrating the fact 
that technical and day-to-day variations are smaller than the natural variation across 
individuals in this dataset. 

Template-based classihcation has been used in several areas such as face recog¬ 
nition, speech recognition, character recognition, etc. In face recognition 161 , a 


template library is created with one or more digital images from each person. An 
unclassihed image is compared to each database image by computing correlations of 
diherent features (eyes, nose, mouth etc.) and is classihed as the one giving the high¬ 


est cumulative score. In speech recognition 162,163 , a template is created for each 
speaker by a sequence of consecutive acoustic feature vectors and an incoming signal is 
classihed by comparing it with the templates using the Dynamic Time Warping algo¬ 


rithm. In character recognition 164 , representative prototypes for each character are 
created from diherent writing styles and an incoming character is classihed by com- 
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paring it to existing prototypes using a feature matching algorithm. The algorithms 
discussed in this chapter have similarities to these methods in principle but differs 
from them signihcantly in how the templates are created, represented, and compared 
with incoming samples. In contrast to these approaches, I maintain a hierarchy of the 
training samples in order to use their relationships in future classihcation. A tem¬ 
plate is then represented with the shared features of all samples in a class whereas 
the methods discussed above use representatives from the training set. Furthermore, 
the dynamic template algorithm continuously updates templates as new samples are 
classihed, which improves the accuracy of future classihcation. 


The rest of this chapter is organized as follows. In Section |6.2| , I describe the 
classihcation methods based on static templates created from a collection of training 


samples. Section 6.3 outlines a robust dynamic classihcation algoritm. The next 


Section 6.4 demonstrates the application of template-based classihcation with two 


representative datasets. I conclude this chapter in Section 6.5 


6.2 Classifying Samples with Static Templates 


6.2.1 Creating static templates from a collection of FC samples 


Consider a collection of N FC samples belonging to M disjoint classes. In Sec¬ 


tion 


5.2, I described a hierarchical matching-and-merging (HM&M) algorithm that 


organizes N samples into a binary template-tree data structure. A node in the tree 
represents either a sample (leaf node) or a template (internal node). In both cases 
a node is characterized by a hnite mixture of multivariate normal distributions each 
component of which is a cluster or meta-cluster. The height of an internal node in 
the template-tree is measured by the dissimilarity between its left and right children. 
By recursion, a template denoted by a relatively lower internal node represents a 
relatively homogeneous collection of samples and vice versa. 

After building a template-tree, we can cut the tree at a suitable height so that 
M disjoint subtrees are produced. The root of each subtree represents a template of 
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the samples placed in the leaves of that subtree. The class (label) of a template is 
determined by the label of the majority of the samples in the subtree rooted at the 
template. However, if the number of classes M is not known a priori, M is set to the 
number of well-separated branches based on the relative heights of the subtrees. The 
roots of these well-separated subtrees represent the class templates, where within-class 
variations (heights of the subtrees) are small relative to the between-class variations 
(heights of the ancestors of the subtrees). 

6.2.2 Classifying new samples with static templates 

Let Ti, T 2 , ..., Tm be M templates created from a collection of N samples, where 
the template Ti summarizes samples of the class. The dissimilarity D{S,Ti) 
between a sample S and a template Ti is computed by the mixed edge cover (MEC) 
algorithm described in Chapter (Note that both a sample and a template are 
characterized by hnite mixtures of multivariate normal distributions each component 
of which is a cluster or meta-cluster.) The new sample S is predicted to belong to 
the class whose template it is most similar (least dissimilar) to: 

i* = arg^min^ Zi)(S', Tj), class(S') = class(Tj*). (6.1) 

During classihcation, if the sample’s dissimilarity with the closest template is above 
a threshold (i.e., it is not similar to any of the class templates), then it represents a 
new class not represented by a template. Hence we need to create a new template for 
this sample. I address this issue in the next section. 

The template-based classihcation is fast because we need to compare a new sample 
only with M templates instead of with N samples. Let K be the maximum number of 
clusters or meta-cluster present in a samples or template. Then the time complexity 
of a classihcation is 0{MK^\ogK), where 0{K^\ogK) time is required to compute 
dissimilarity (mixed edge cover) between samples. Typically the number of distinct 
classes (M) is much smaller than the number of samples N. Hence, computing 0{M) 
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dissimilarity is faster than classifying the sample from scratch, which reqnires 0{N) 
dissimilarity compntations. 


6.3 Bnilding dynamic templates 
6.3.1 The algorithm 


The classihcation method based on static templates has two limitations. First, 
the algorithm reqnires a signihcant nnmber of samples from different biological classes 
to constrnct meaningfnl class templates. Hence, this approach is inadeqnate when 
samples arrive seqnentially or in batches, as for instance in a longitndinal stndy of 
an epidemic. Second and more important, the algorithm bnilds a hxed set of static 
templates and does not npdate templates as new samples are classihed. Therefore, 
fntnre classihcation can not nse the information gained from samples classihed with 
the static template-tree. 

I address both of these limitations with dynamic npdates of templates as new 
samples are classihed. The npdate operation is performed by inserting the new sample 
into the cnrrent template-tree and npdating necessary internal nodes (templates) in 
the tree. Snbseqnent classihcation tasks are performed on the regnlarly npdated 
templates. This approach also works when we do not have any training dataset to 
begin with. In that case, the template-tree is created from the scratch by repeated 
insertion of samples in a dynamic fashion starting with empty templates. Consider 
an existing template-tree TT (possibly empty) with r as the root node. Note that r 
can be considered as the template of all samples in the leaves of the tree. In order to 
insert a new sample S in TT, I hrst create a singleton node v from S. If TT is an empty 
tree I make v the root of the template-tree, and otherwise, I insert v into the tree TT 


by invoking the procednre insert shown in Fig. 6.1 with r and v as the parameters. 


The procednre insert works in a recursive fashion. It follows a path from the 
root to a node (a leaf or internal node), to be identihed by the algorithm, where the 
new node v is inserted. The procedure then backtracks by updating the mixtures of 
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> Insert leaf node v in the subtree rooted at u 
> Inserting in an empty tree 


> Case 1 


1: procedure insert(M,n) 

2: if M is a empty then 

3: return v 

4: end if 

5: if M is a leaf then 

6: w ^ empty node, Wi ^ u, tCj, t— f, x ^ w 

7: else 

8: D -ir- mm{D{ui, Ur), D{ui, v), D{ur, n)} 

9: if D{ui,Ur) = D then 

10: w •(— empty node, wi u, Wr v, x ■(— ta 

11: else if D{ui,v) = D then 

12: M; insert( m/, n), x^u 

13: else > when D{ur,v) = D, Case 4 

14: Mj, <(—insert( m^, n), x^u 

15: end if 

16: end if 

17: update node x by matching and merging meta-clusters from xi and Xr 

18: height(x) = D{xi,Xr) 

19: return x t> Going up in the tree 

20: end procedure 


> Case 2 


> Case 3 


Figure 6.1. Inserting a leaf node (sample) u in a subtree rooted at u 
(template or sample) 


the internal nodes found in the return path back to the root. I consider four cases 
while inserting u in a subtree rooted at u. The cases are illustrated in Fig. |6.2 In the 
hrst case u is a leaf node, and I create a new node w and make u and v the children of 
w. I create a template from the samples in the leaves u and v and save it in node w. 
In the other cases u is an internal node. Let ui and Ur be the left and right children 
of M, respectively. I compute dissimilarities D{ui,Ur), D{ui,v) and D{ur,v) between 
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Figure 6.2. Four cases to consider when inserting a leaf node v into the 
snbtree rooted at u. (a) case 1 {u is also a leaf): a new internal node w is 
created and is made the parent of u and u, (b) case 2 (u is a non-leaf and 
the left and right children of u are more similar to each other than to u): 
a new internal node w is created and is made the parent of u and u, (c) 
case 3 (u is a non-leaf and the left child ui of u is more similar to v than 
to the right child Ur)'. insert v into the snbtree rooted at ui by calling 
insert(Mj, u), and (d) case 4 (u is a non-leaf and the right child Ur of u 
is more similar to v than to the left child ui): insert v into the snbtree 
rooted at Ur by calling insert(ur,u). The dotted parts in Snbhgure (c) 
and (d) are determined by the insert fnnction in a recursive fashion. 


each pair of nodes from ui,Ur, and v. If D{ui,Ur) is the smallest among the three 
dissimilarities, then v cannot be inserted in a snbtree rooted at u. Thus I create a 
new node w and make u and v the children of w. I create a new template from the 
template u and sample v, save it in node w and retnrn. When D{ui,v) is the smallest 
dissimilarity, I insert u in a snbtree rooted at ui by calling the procednre insert 
with ui,v as parameters. In this case the left snbtree of u gets npdated. Similarly, if 
D{ur,v) is the smallest then v is inserted in the right subtree rooted at Ur. 


6.3.2 Computational complexity 

To insert a new sample, we need to traverse a path starting from the root to a leaf 
or an internal node in a template-tree. In the worst case the length of the traversed 
path is the height of the template-tree. Let N be the nnmber of samples and h be 
the height of a template-tree where {N — 1) < h < log 2 (iV). The former eqnality 
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holds when the tree is completely unbalanced (a chain) whereas the latter equality is 
satisfied when the tree is balanced. At each node in the traversed path we need to 
compute three dissimilarities and one update operation (when backtracking). Let K 
be the maximum number of clusters or meta-clusters in a node of the template-tree 
and p be the dimension (number of features) of the data. A dissimilarity computation 
takes 0{K^ log K) time whereas an update operation takes 0{K^ log K)+0{Kp) time 
when meta-cluster parameters are computed by the maximum likelihood estimation. 
Hence, the time complexity of inserting a sample in a template-tree is 0{hK^ \ogK). 


6.3.3 Classifying a sample 


To classify a new sample S, it is inserted into the current template-tree using 


the procedure described in Fig. |6.1[ The class of S is predicted to be the class of 
the template created from the subtree where S is inserted. At the time of insertion 
the template-tree is dynamically updated to reflect the information gained from the 
new sample. The dynamic template approach is especially useful in unsupervised 
classification where the class labels of the samples are not known in advance. In 
that case, the class templates are created from the well-separated subtrees such that 
within-class variations (heights of the subtrees) are small relative to the between-class 
variations (heights of the ancestors of the subtrees). In this context, the algorithm 
is similar to the spirit of the hierarchical clustering algorithm UPGMA, with signifi¬ 
cant differences in the distance computation and management of the internal nodes. 
Furthermore, when a new sample is highly dissimilar to every existing template, the 
algorithm automatically creates a new branch in the tree indicating a new class. This 
approach therefore has the ability to discover unknown classes from the incoming 
samples, which, for example, is very useful in detecting new strains of a disease. 


The algorithm inserts samples into the template-tree in a particular order, which 
can determine the relative positions of nodes in the template-tree. How sensitive 
is the template-based classification to the order in which the samples are inserted 
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into the template-tree? If the between-class variation is signihcantly higher than the 
within-class variation (as is the case in the two datasets studied in this chapter), the 
classihcation accuracy is unaffected by the small differences in the subtrees of the 
template-tree. In current work, we are studying different rearrangement techniques 
for creating robust template-tree insensitive to the order of sample insertions. 


6.4 Results 

6.4.1 Classifying stimulation status of T cells 


I apply the template-based classihcation algorithm to classify the stimulation sta¬ 
tus of 29 pairs of samples collected before and after stimulating blood with anti-CD3 
antibody. Each of the 29 pairs of samples in the T cell phosphorylation (TCP) dataset 
measures the abundance of four protein markers (CD4, CD45RA, SLP-76, and ZAP- 
70) 1^. During the stimulation, anti-CD3 antibody binds with T cell receptors 
(TCR) and phosphorylate few domains of naive and memory T cells. The effect of 
phosphorylation is revealed by the higher expressions of SLP-76 and ZAP-70 proteins 
in T cells after the stimulation. Section [1.6.21 describes this dataset in detail. 

For a pair of samples from the subject, I denote the unstimulated sample by 
i— and the stimulated sample by i+. Each of the 58 samples in the TCP dataset 
was preprocessed and clustered independently to identify four cell populations: (a) 
CD4+CD45RA^°'^ T cells, (b) CD4+CD45RA^‘g^ T cells, (c) CD4-CD45RA^‘g^ T 


cells, and (d) CD4“CD45RA*°’" T cells (cf. Section 3.5.2). (Recall that ‘-f’ and ‘high’ 
indicate higher abundances of a marker, and ‘ —’ and ‘low’ indicate lower levels of it.) 

Classification with static templates-. I divide 29 pairs samples randomly into a 
training set and a test set. From the training samples, the HM&M algorithm con¬ 
structs a template-tree where the left and right children of the root represent the pre- 


and post-stimulation templates. Fig. 6.3 illustrates the classihcation process with a 


template tree created from six pairs of samples. After constructing the tree, it is cut 
beneath the root to create the pre-stimulation (Tbefore) and post-stimulation (Tafter) 
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Figure 6.3. Illustration of sample classification based on static templates. 
The HM&M algorithm creates two templates, Tbefore for before-stimulation 
and Tafter for after stimulation classes, from six pairs of samples in the TCP 
dataset. A new sample is compared with the templates, and is classihed 
with the template it is most similar to. 


templates. To predict the stimulation status of the sample Si in the test set, I com¬ 
pute the dissimilarity D between Si and each of the templates with the cost of the op¬ 
timum mixed edge cover algorithm. Si is predicted to come from the pre-stimulation 
class when Si is more similar to Tbefore than to Tafter (i-e., D{Si, Tbefore) < D{Si, Tafter)), 
and otherwise, from the post-stimulation class. In Fig. |6.3 I show a sample from the 
test set above the two templates. The algorithm correctly classihes it as a pre¬ 
stimulation sample, and the correctness of the classihcation can be verihed visually 
as this sample does not show a phosphorylation shift in SLP-76. (The 3-D scatter 
plots are shown for visualization only; classihcation is performed in 4-D.) 

I study the accuracy of the template-based classiher with a cross-validation. At 
each stage of the cross-validation, training and test sets are created from 10 and 48 
samples respectively. The status of each test sample is predicted by comparing it with 
the templates created from the training set. A sample is considered to be misclassihed 
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Figure 6.4. A dynamic template-tree created incrementally by adding the 
samples in the TCP dataset one after another. Minus and plus signs are 
appended to the subject number to indicate pre- and post-stimulation 
samples. Pre-stimulation samples are in blue, and post-stimulation sam¬ 
ples are red. The height of a node measures the dissimilarity between its 
left and right children, whereas the horizontal placement of a sample is 
arbitrary. 


when its predicted class is different from the actual class. This process is repeated 
58 times for different combinations of training and test sets. I observed that three 
pre-stimulation samples, 9-, 10- and, 11-, were consistently classified with the after¬ 
stimulation class whenever they were present in the test set. No other sample is 
classified into a wrong class in the cross-validation. I consider these three samples 
as outliers, show that they are likely to have been stimulated before the experiment, 
and discuss their properties further in the following section. 

Classification with dynamic templates: In order to demonstrate the classification 
approach based on dynamic templates, I build a template-tree incrementally from the 
samples in the TCP dataset. Starting with an empty tree, the algorithm described 
in Sec. m inserts the samples one after another into the current tree. Fig. |6.4| shows 
the complete template-tree where pre-stimulation samples are shown in blue and 
post-stimulation samples are shown in red. Aside from three outlying samples, all 
samples create two well-separated branches of the root denoting the pre- and post- 
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Figure 6.5. Heat plot showing the dissimilarity between pairs of samples 
in the TCP dataset where the color of a square corresponds to the dissim¬ 
ilarity of a pair of samples. A square is drawn in a light shade when the 
pair of samples is similar, and in a dark shade when the pair of samples 
is highly dissimilar. The plot is symmetric about the main diagonal. The 


samples are listed as they are ordered horizontally in Fig. 6.4 


stimulation templates. The height of an internal node in a template-tree is measured 
by the dissimilarity between the pair of samples (templates) denoted by the left and 


right children of the internal node. The height of the root in Fig. 6.4 is more than 
twice of the height of any other node. Hence the algorithm successfully identihes two 
templates with small within-template deviation while maintaining a clear separation 
between them. 

When inserting a new sample S, the template-tree is dynamically updated to 
reflect the information gained from the new sample. After insertion, the position 
of S in the tree determines its predicted class. S is classihed as a pre-stimulation 
sample when it is placed in the left (blue) subtree and otherwise, classihed as a post¬ 
stimulation sample. Similar to the classihcation with static templates, we observe in 
Fig. |6.4 that all samples except 9-, 10- and 11-, are correctly classihed. 
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Figure 6.6. Levels of SLP-76 expression in each pair of samples (joined by 
a line) from the TCP dataset. For most samples, SLP-76 levels increase 
after the anti-CD3 stimulation. However, the three samples {9,10,11} 
have high levels of SLP-76 protein in their pre-stimulation states, and do 
not show the usual increase after stimulation. 


Outlying samples: Now I discuss the three outlying pre-stimulation samples, 9- 
, 10- and 11-, which are consistently classihed with the post-stimulation samples by 


both the static and dynamic classihcation algorithms. Fig. |6.5| shows the dissimilarity 
between every pair of samples in the dataset in a heatplot where a square is drawn 
in a lighter shade when a pair of samples is similar, and in darker shade when a pair 
of samples is highly dissimilar. We observe that most squares in the top-left and 
bottom-right quadrants are in light colors indicating similarity among samples within 
pre- and post-stimulation classes. However, three pre-stimulation samples, 9-, 10- 
and 11-, are more similar to the post-stimulation samples than to the pre-stimulation 
samples. This anomaly can be explained by plotting the average expression of SLP-76 


protein for each pair of samples in Figure 6.6 The three outlying samples {9,10,11} 
have much higher levels of SLP-76 than the remaining samples in their pre-stimulation 
states. On stimulation, two of these samples have decreased levels of SLP-76, while 
one shows an increase. Consequently these samples are classihed with the post¬ 
stimulations samples by the templates-based classihcation algorithms presented here. 
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Figure 6.7. The template-tree created from samples in the healthy donor 
dataset. The algorithm identihes hve well separated branches denoting 
templates for the hve subjects. A subtree consisting of samples from the 
same subject is shown in same color. 


6.4.2 Classifying immune patterns in healthy individuals 


I further validated the template-based classihcation with a healthy donor (HD) 


dataset described in Section |1.6.1[ The HD dataset represents a “biological simu¬ 
lation” where peripheral blood mononuclear cells (PBMC) were collected from hve 
healthy subjects on diherent days, and each sample was divided into hve parts and 
analyzed through a how cytometer at Purdue’s Bindley Biosciences Center. Each of 
the 65 samples is preprocessed, variance stabilized, and clustered to identify four cell 
populations: (a) helper T cells (CD3’''CD4+), (b) cytotoxic T cells (CD3’''CD8’''), (c) 
B cells (CD3“CD19’''), and (d) natural killer cells (CD3“CD19“). Detail discussion 


about the clustering step can be found in Section 3.5.1 


A template tree is created from the 65 samples in the HD dataset by repeated inser¬ 


tion of samples with the procedure described in Sec. 6.3 Fig. 6.7 shows the complete 


template-tree where samples from hve individuals are shown in hve diherent colors. 
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We observe that samples from the hve subjects create hve well-separated branches and 
the roots of the hve branches denote subject-specihc templates. Intuitively we expect 
samples from each subject to be classihed together. Here, the within-subject varia¬ 
tions of a subject come from day-to-day natural variations and technical variations in 
how cytometry sample preparation and measurement, whereas the between-subject 
variations come from the innate biological variability in the healthy subjects. In this 
dataset we observe more natural variation than the temporal and technical variations. 

The observations from the healthy donor dataset conhrm that we can build im¬ 
mune prohles for individuals despite within-subject variations from diherent sources. 
Additionally, the hve templates from the hve subjects create another level of hierar¬ 


chy and the root of the tree in Fig. |6.7| can be considered as a template for healthy 
individuals. This combined template represents a healthy immune prohle by preserv¬ 
ing the common features of healthy individuals and by removing between subject 
variations. This template can be compared against templates created from diseased 
samples in order to diagnose diseases and to perform comparative study of healthy 
and diseased immune prohles. 


6.5 Conclusions 

I describe template-based classihcation methods for how cytometry samples dis¬ 
playing a combination of diherent immunophenotypes. A template built from samples 
of a class provides a concise description of the class by emphasizing the key character¬ 
istics while masking statistical noise and low-level details, and thus helps to measure 
overall changes in cell populations across diherent conditions. By moving beyond 
sample-specihc variations, the templates act as the blueprints for diherent classes, 
and can be used to classify future samples to diherent classes in a more relevant pa¬ 
rameter space. It is also more efficient to classify a sample using templates rather 
than all of the previously seen samples. I maintain a hierarchy of the samples in a 
template-tree such that samples can be analyzed in higher resolution as needed. 
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The major contribution in this chapter is a dynamic algorithm to construct and 
update the templates, and build and maintain the template-tree, when the samples 
arrive continuously over a period of time. As new samples come in, the templates are 
dynamically updated to reflect the information gained from them. This is a desirable 
property in dynamic situations, as in the course of an epidemic, when new samples 
are being collected and analyzed. Another context where the dynamic classihcation 
approach is useful is when the samples are collected at a large number of hospitals or 
labs; the data at each hospital can be analyzed in situ, and only the summaries need 
to be shared among the hospitals to create a global prohle of the immune system, 
thus avoiding issues with privacy of clinical data. 

Dynamic classihcation is a critical step towards characterizing diverse states of 
the human immune system from big datasets of samples collected at geograph¬ 
ically distributed laboratories, e.g., the Human Immunology Project Consortium 
(www.immuneprofiling.org). This work makes it possible to summarize the data 
from each laboratory using templates for each class, and then to merge the templates 
and template-trees across various laboratories, as the data is being continuously col¬ 
lected and analyzed. 
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7 CLASSIFICATION OF ACUTE MYELOID LEUKEMIA 


7.1 Introduction 


Can Acute Myeloid Leukemia (AML) samples be distinguished from healthy ones 
using flow cytometry data from blood or bone marrow with a template-based clas- 
sihcation method? This method builds a template for each class to summarize the 
samples belonging to the class, and uses them to classify new samples as described in 
Chapter This question is interesting because AML is a heterogeneous disease with 
several subtypes and hence it is not clear that a template can succinctly describe all 
types of AML. Furthermore, we wish to identify immunophenotypes (cell types in the 
bone marrow and blood) that are known to be characteristic of subtypes of AML. 
Pathologists use these immunophenotypes to visualize AML and its subtypes, and a 
computational procedure that can provide this information would be more helpful in 
clinical practice than a classihcation score that indicates if an individual is healthy 
or has AML. 

In Chapter I described a template-based classihcation framework for analyzing 


how cytometry (FC) data 46 . In this framework, each sample is characterized by 
means of the cell populations that it contains. Similar samples belonging to the same 
class are described by a template for the class. A template consists of meta-clusters 
that characterize the cell populations present in the samples that constitute the class. 
I described a hierarchical algorithm in Chapter that organizes the templates into a 
template tree. Given a sample to classify, we can compare it with the nodes in the 
template tree, and classify it to the template that it is closest to. A combinatorial 
measure for the dissimilarity of two samples or two templates, computed by means 
of a mixed edge cover in a graph model (described in Chapter |^, is at the heart of 
this approach. 
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The algorithmic pipeline for template-based classihcation has been applied to 
various problems: to distinguish the phosphorylation state of T cells; to study the 
biological, temporal, and technical variability of cell types in the blood of healthy 
individuals; to characterize changes in the immune cells of Multiple Sclerosis patients 


undergoing drug treatments; and to predict the vaccination status of HIV patients 37 


46 . However, it is not clear if the AML data set can be successfully analyzed with 


this scheme, since AML is a heterogeneous disease at the morphologic, cytogenetic 
and molecular levels, and a few templates may not describe all of its subtypes. 

AML is a disease of myeloid stem cells that differentiate to form several types 
of cells in the blood and marrow. It is characterized by the profusion of imma¬ 
ture myeloid cells, which are usually prevented from maturing due to the disease. 
The myeloid stem cell differentiates in several steps to form myeloblasts and other 
cell types in a hierarchical process. This hierarchical differentiation process could 
be blocked at different cell types, leading to the multiple subtypes of AML. Eight 
different subtypes of AML based on cell lineage are included in the French-American- 


British Cooperative Group (FAB) classihcation scheme 165 . (A different World 
Health Organization (WHO) classihcation scheme has also been published.) Since 
the prognosis and treatment varies greatly among the subtypes of AML, accurate 
diagnosis is critical. 

In this chapter, I extend the template-based classihcation scheme presented in 
Chapter [^by developing a scoring function that accounts for the subtleties of FC data 


of AML samples 45 . Only a small number of the myeloid cell populations in AML 
samples are specihc to AML, and there are a larger number of cell populations that 
these samples share with healthy samples. Furthermore, the scoring function needs 
to account for the diversity of the myeloid cell populations in the various subtypes of 
AML. This approach has the advantage of identifying immunophenotypes of clinical 
interest in AML from the templates. Earlier work on the AML dataset used in 
this chapter has classihed AML samples using methods such as nearest neighbor 
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classification, logistic regression, matrix relevance learning vector quantization, etc.. 


but they have not identihed these immunophenotypes; e.g., 166-168 


Template-based classihcation has the advantage of being more robust than simple 
nearest neighbor classihcation since a template summarizes the characteristic prop¬ 
erties of a class while ignoring small sample-to-sample variations. It is also scalable 
to large numbers of samples, since it compares a sample to be classihed only against 
a small number of templates rather than the much larger number of samples. The 
comparisons with the templates can be performed efficiently using the structure of 
the template tree. It also reduces the data size by clustering the data to identify cell 
populations and then working with the statistical distributions characterizing the cell 
populations, in contrast to some of the earlier approaches that work with data sets 
even larger than the FC data by creating multiple variables from a marker (reciprocal, 
powers, products and quotients of subsets of the markers, etc. 167| ). 

Template-based classihcation has been employed in other areas such as character, 
face, and image recognition, but its application to FC is relatively recent. In addition 


to our work, templates have been used for detecting the effects of phosphorylation 27 


evaluating the efficiency of data transformations 23 , and labeling clusters across 


samples 39 


7.2 Methods 

7.2.1 The AML Dataset 


I have used an FC dataset on AML that was included in the DREAM6/FlowCAP2 


challenge of 2011 169 . The dataset consists of FC measurements of peripheral blood 
or bone marrow aspirate collected from 43 AML positive patients and 316 healthy 
donors over a one year period. Each patient sample was subdivided into eight aliquots 
(“tubes”) and analyzed with different biomarker combinations, hve markers per tube 


(most markers are proteins) as described in Table 7.1 In addition to the markers, the 


forward scatter (FS) and side scatter (SS) of each sample was also measured in each 
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Table 7.1 

The fluorophore-conjugated antibodies contained in each of the 8 tubes in 
which the samples were incubated. 


Tubes 

FLl 

(FITC) 

FL2 

(PE) 

FL3 

(EGD) 

FL4 

(PG5) 

FL5 

(PG7) 

1 

IgGl 

IgGl 

GD45 

IgGl 

IgGl 

2 

Kappa 

Lambda 

GD45 

GD19 

GD20 

3 

CD7 

GD4 

GD45 

GD8 

GD2 

4 

CD15 

GD13 

GD45 

GD16 

GD56 

5 

CD14 

GDllc 

GD45 

GD64 

GD33 

6 

HLA-DR 

GD117 

GD45 

GD34 

GD38 

7 

CD5 

GD19 

GD45 

GD3 

GDIO 

8 

NA 

NA 

NA 

NA 

NA 


tube. Hence, the dataset contains 359 x 8 = 2, 872 samples and each sample is seven¬ 
dimensional (hve markers and the two scatters). Tube 1 is an isotype control used to 
detect non-specihc antibody binding and Tube 8 is an unstained control for identifying 
background or autofluorescence of the system. Since the data has been compensated 
for autofluorescence and spectral overlap by experts, I omit these tubes from the 
analysis. The disease status (AML/healthy) of 23 AML patients and 156 healthy 
donors are provided as training set, and the challenge is to determine the disease status 
of the rest of the samples, 20 AML and 157 healthy, based only on the information in 
the training set. The complete dataset is available at http: //f lowrepository. org/. 

The side scatter (SS) and all of the fluorescence channels are transformed loga¬ 
rithmically, but the forward scatter (FS) is linearly transformed to the interval [0,1] 
so that all channels have values in the same range. This removes any bias towards 
FS channel in the multi-dimensional clustering phase. After preprocessing, an FC 
sample is stored as an n x p matrix A, where the element A{i,j) quantihes the 
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feature in the cell, and p is the number of features measured in each of n cells. In 
this dataset, p = 7 for each tube and n varies among the samples. 


7.2.2 Identifying cell populations in each sample 


I employ a two-stage clustering approach for identifying phenotypically similar 
cell populations (homogeneous clusters of cells) in each sample. At hrst, I apply the 
fc-means clustering algorithm for a wide range of values for fc, and select the optimum 
number of clusters k* by simultaneously optimizing the Calinski-Harabasz and S_Dbw 


cluster validation methods 34 . Next, the clusters identihed by the fc-means algorithm 


are modeled with a hnite mixture model of multivariate normal distributions. In the 
mixture model, the cluster is represented by two distribution parameters the 
p-dimensional mean vector, and S, the p x p covariance matrix. The distribution 
parameters for each cluster are then estimated using the Expectation-Maximization 
(EM) algorithm. The statistical parameters of a cluster are used to describe the 
corresponding cell population in the rest of the analysis. The population identihcation 
process is described in Chapter 


7.2.3 Creating templates from a collection of samples 

The hierarchical matching-and-merging (HM&M) algorithm described in Chap¬ 
ter is used to create templates representing distinct classes of samples. The HM&M 
algorithm arranges a set of similar samples into a binary template tree data struc¬ 
ture j^. The dissimilarity between a pair of samples is computed by the cost of 
optimum mixed edge cover solution discussed in Chapter 

7.2.4 Classihcation score of a sample in AML dataset 

Consider a sample X consisting of k cell populations S = {ci, C 2 ,..., Cfc}, with 
the cluster Cj containing |cj| cells. Let T~ and T~^ be the templates created from 



121 


AML-negative (healthy) and AML-positive training samples, respectively. I describe 
how to compute a score f{X) in order to classify the sample X to either the healthy 
class or the AML class. 

The intuition behind the score is as follows. An AML sample contains two kinds 
of cell populations: (1) AML-specific myeloblasts and myeloid cells, and (2) AML- 
unrelated cell populations, such as lymphocytes. The former cell populations corre¬ 
spond to the immunophenotypes of AML-specific metaclusters in the AML template, 
and hence when we compute a mixed edge cover between the AML template and an 
AML sample, these clusters get matched to each other. (Such clusters in the sample 
do not match to any metacluster in the healthy template.) Hence a positive score 
is assigned to a cluster in sample when it satishes this condition, signifying that it 
is indicative of AML. AML-unrelated cell populations in a sample could match to 
meta-clusters in the healthy template, and also to AML-unrelated meta-clusters in 
the AML template. When either of these conditions is satisfied, a cluster gets a neg¬ 
ative score, signifying that it is not indicative of AML. Since AML affects only the 
myeloid cell line and its progenitors, it affects only a small number of AML-specific cell 
populations in an AML sample. Furthermore, different subtypes of AML affect differ¬ 
ent cell types in the myeloid cell line. Hence, there are many more clusters common 
to healthy samples than there are AML-specihc clusters common to AML samples. 


(This is illustrated later in Fig. 7.2 (c) and (d).) Thus the range of positive scores 
are made relatively higher than the range of negative scores. This scoring system is 
designed to reduce the possibility of a false negative (an undetected AML-positive 
patient), since this is more serious in the diagnosis of AML. Additional data such 
as chromosomal translocations and images of bone marrow from microscopy could 
conhrm an initial diagnosis of AML from flow cytometry. 

In the light of the discussion above, we need to identify AML-specihc metaclusters 
initially. Given the templates T~^ and T~, we create a complete bipartite graph with 
the meta-clusters in each template as vertices, and with each edge weighted by the 
Mahalanobis distance between its endpoints. A minimum cost mixed edge cover 
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solution in this graph will match meta-clusters common to both templates, and such 
meta-clusters represent non-myeloid cell populations that are not AML-specihc. On 
the other hand, meta-clusters in the AML template that are not matched to a 
meta-cluster in the healthy template T~ correspond to AML-specihc metaclusters. 
Such meta-clusters in the AML template T+ are denoted by the set . 

Now we can proceed to compare a sample against the template for healthy samples 
and the template for AML. I compute a minimum cost mixed edge cover between 
a sample X and the healthy template T“, and let mec“(ci) denote the set of meta¬ 
clusters in T~ mapped to a cluster Cj in the sample X. Similarly, compute a minimum 
cost mixed edge cover between X and the AML template T"*", and let mec’^(ci) denote 
the set of meta-clusters in mapped to a cluster Cj. These sets could be empty if Cj 
is unmatched in the mixed edge cover. I compute the average Mahalanobis distance 
between q and the meta-clusters matched to it in the template T“, and dehne this as 


the dissimilarity d{ci, mec (cj)). From the formulation of the mixed edge cover in 37 


we have (i(cj, mec“(cj)) < 2A, where A is the cost of leaving a cluster unmatched 
in the mixed edge cover solution. Hence we can dehne the similarity between c* 
and mec“(ci) as s(ci,mec“(ci)) = 2A — (i(ci,mec“(ci)). By analogous reasoning, the 
similarity between c* and mec+(ci) is dehned as s(cj, mec+(cj)) = 2A — d{ci, mec+(cj)). 

The score of a sample is the sum of the scores of its clusters. I dehne the score 
of a cluster Cj, /(q), as the sum of two functions /^(q) and /“(q) multiplied with 
suitable weights. A positive score indicates that the sample belongs to AML, and a 
negative score indicates that it is healthy. 

The function /’’"(cj) contributes a positive score to the sum if Cj is matched to an 
AML-specihc meta-cluster in the mixed edge cover between the sample X and the 
AML template T"*", and a non-positive score otherwise. For the latter case, there are 
two subcases: If Cj is unmatched in the mixed edge cover, it corresponds to none of 
the meta-clusters in the template T+, and gets a zero score. If q is matched only to 
non-AML specihc meta-clusters in the AML template T"*", then it is assigned a small 
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negative score to indicate that it likely belongs to the healthy class (recall that k is 
the number of clusters in sample X). Hence 




s (Q,mec+(ci)), 

< [s(ci,mec+(ci))] , 
0 , 


if mec+(ci) fl M+ 7 ^ 0 , 

if mec+(cj) fl M+ = 0 , and mec+(cj) 7 ^ 0 , 

if mec’''(ci) = 0 . 


The function /“(cj) contributes a negative score to a cluster Cj in the sample X 
if it is matched with some meta-cluster in the healthy template T~, indicating that 
it likely belongs to the healthy class. If it is not matched to any meta-cluster in T~, 
then it is assigned a positive score A. This latter subcase accounts for AML-specihc 
clusters in the sample, or a cluster that is in neither template. In this last case, we 
acknowledge the diversity of cell populations in AML samples. Hence we have 

{ -| [s(ci,mec“(Q))] , if mec“(ci) 7 ^ 0 , 

A, if mec“(ci) = 0. 


Finally, we dehne 

CiGX I I 

Here |X| is the number of cells in the sample X. The score of a cluster q is weighted 
by the fractional abundance of cells in it. 


7.3 Results 

7.3.1 Cell populations in healthy and AML samples 


In each tube, I identify cell populations in the samples using the clustering al¬ 
gorithm described in Section |7.2.2 Each sample contains hve major cell types that 
can be seen when cell clusters are projected on the side scatter (SS) and CD45 chan¬ 


nels, as depicted in Fig. 7.1 (Blast cells are immature progenitors of myeloid cells 
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Figure 7.1. Cell types identified on the side scatter (SS) and CD45 chan¬ 
nels for a healthy and an AML positive sample. Cell populations are dis¬ 
covered in the seven-dimensional samples with the clustering algorithm 
and then projected on these channels for visualization. A pair of clusters 
denoting the same cell type is marked with the same color. The proportion 
of myeloid blast cells (shown in red) increases signihcantly in the AML 
sample. 


or lymphocytes.) The side scatter measures the granularity of cells, whereas CD45 
is variably expressed by different white blood cells (leukocytes). AML is initially di¬ 
agnosed by rapid growth of immature myeloid blast cells with medium SS and CD45 


expressions 141 marked in red in Fig. 7.1 According to the WHO guidelines, AML 
is initially conhrmed when the sample contains more than 20% blasts. This is the 
case for all, except one of the AML samples in the DREAM6/FlowCAP2 training set, 
and the latter will be discussed later. 


7.3.2 Healthy and AML templates 


From each tube of the AML dataset, using the training samples, we build two 


templates: one for healthy samples, and one for AML. As described in Section 7.2.3 


the HM&M algorithm organizes samples of the same class into a binary template 
tree whose root represents the class template. The template trees created from the 
healthy and AML training samples in Tube 6 are shown in Subhgures |7.2K a) and 
|7.2[b) respectively. The height of an internal node in the template tree measures 
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Meta-clusters in the healthy template 


Meta-clusters in the AML template 


Figure 7.2. The healthy and AML templates created from Tube 6. (a) 
The template-tree created from 156 healthy samples in the training set. 
(b) The template-tree created from 23 AML samples in the training set. 
Samples in the red subtree exhibit the characteristics of Acute Promye- 
locytic Leukemia (APL) as shown in Subhgure (f). (c) Fraction of 156 
healthy samples present in each of the 22 meta-clusters in the healthy 
template. Nine meta-clusters, each of them shared by at least 60% of the 
healthy samples, form the core of the healthy template, (d) Fraction of 
23 AML samples present in each of the 40 meta-clusters in the AML tem¬ 
plate. The AML samples, unlike the healthy ones, are heterogeneously 
distributed over the meta-clusters, (e) The expression levels of markers 
in the meta-cluster shown with blue bar in Subhgure (d). (Each hori¬ 
zontal bar in Subhgures (e) and (f) represents the average expression of 
a marker and the error bar shows its standard deviation.) This meta¬ 
cluster represents lymphocytes denoted by medium SS and high CD45 
expression and therefore does not express the AML-related markers mea¬ 
sured in Tube 6 . (f) Expression of markers in a meta-cluster shown with 
red bar in Subhgure (d). This meta-cluster denotes myeloblast cells as 
dehned by the SS and CD45 levels. This meta-cluster expresses HLA- 
DR“CD117’''CD34“CD38+, a characteristic immunophenotype of APL. 
Five AML samples sharing this meta-cluster are similar to each other as 
shown in the red subtree in Subhgure (b). 
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the dissimilarity between its left and right children, whereas the horizontal placement 
of a sample is arbitrary. In these trees, we observe twice as much heterogeneity in 
the AML samples than among the healthy samples (in the dissimilarity measure), 
despite the number of healthy samples being hve times as numerous as the AML 
samples. The larger heterogeneity among AML samples is observed in other tubes as 
well. The template-tree for AML partitions these samples into different subtrees that 
possibly denote different subtypes of AML. For example, the subtree in Fig. |7.2[ b) 
that is colored red includes samples (with subject ids 37, 58, 67, 89, and 117) with 
immunophenotypes of Acute Promyelocytic Leukemia (APL) (discussed later in this 
section). 

Together, the meta-clusters in a healthy template represent a healthy immune 
prohle in the feature space of a tube from which the template is created. 22 meta¬ 
clusters are obtained in the healthy template created from Tube 6. The percentage 
of samples from the training set participating in each of these meta-clusters is shown 
in Fig. 7.2[ c). Observe that 60% or more of the healthy samples participate in the 
nine most common meta-clusters (these constitute the core of the healthy template). 
The remaining thirteen meta-clusters include populations from a small fraction of 
samples. These populations could correspond to biological variability in the healthy 
samples,variations in the FC experimental protocols, and possibly also from the split¬ 
ting of populations that could be an artifact of the clustering algorithm. 

The AML template created from Tube 6 consists of forty meta-clusters (almost 
twice the number in the more numerous healthy samples). Fig. |7.2[ d) shows that, 
unlike the healthy samples, the AML samples are heterogeneous with respect to the 
meta-clusters they participate in: There are 21 meta-clusters that include cell popu¬ 
lations from at least 20% of the AML samples. Some of the meta-clusters common 
to a large number of AML samples represent non-AML specihc cell populations. For 
example. Fig. 7.2[e) shows the average marker expressions of the meta-cluster shown 


in the blue bar in Fig. 7.2(d). This meta-cluster has low to medium side scatter 
and high CD45 expression, and therefore represents lymphocytes (Fig. |7.1 ). Since 
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lymphocytes are not affected by AML, this meta-cluster does not express any AML- 
related markers, and hence can be described as HLA-DR“CD117“CD34“CD38“, as 
expected. Fig. |7.2K f) shows the expression profile of another meta-cluster shown in 
the red bar in Fig. 7.2[ d). This meta-cluster consists of five cell populations from hve 
AML samples (with subject ids 37, 58, 67, 89, and 117) and exhibits medium side 
scatter and CD45 expression and therefore, represents myeloid blast cells. Further¬ 
more, this meta-cluster is HLA-DR“CD117’''CD34“CD38^, and represents a profile 
known to be that of Acute Promyelocytic Leukemia (APL) |170| . APL is subtype 
M3 in the FAB classification of AML |165|) and is characterized by chromosomal 


translocation of retinoic acid receptor-alpha (RARa) gene on chromosome 17 with 
the promyelocytic leukemia gene (PML) on chromosome 15, a translocation denoted 
as t(15;17). In the feature space of Tube 6, these APL samples are similar to each 
other while significantly different from the other AML samples. The template-based 
classification algorithm groups these samples together in the subtree colored red in 
the AML template tree shown in Fig. |7.2[ b). 

7.3.3 Identifying meta-clusters symptomatic of AML 

In each tube, meta-clusters are registered across the AML and healthy templates 
using the mixed edge cover (MFC) algorithm. The unmatch-penalty A is set to \/7 
so that a pair of meta-clusters get matched only if the average squared deviation 
across all dimensions is less than one. Meta-clusters in the AML template that are 
not matched to any meta-clusters in the healthy template represent the abnormal, 
AML-specific immunophenotypes while the matched meta-clusters represent healthy 
or non-AML-relevant cell populations. Table [7^ lists several unmatched meta-clusters 
indicative of AML from different tubes. As expected, every unmatched meta-cluster 
displays medium side scatter and CD45 expression characteristic of myeloid blast cells. 


and therefore we can omit FS, SS, and CD45 values in Table 7.2 I briefly discuss 
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Table 7.2 

Some of the meta-clusters characteristic of AML for the 23 AML samples 
in the training set. In the second column, ‘ —‘low’, and ‘-I-’ denote very 
low, low and high, abundance of a marker, respectively, and ± denotes 
a marker that is positively expressed by some samples and negatively 
expressed by others. The number of samples participating in a meta¬ 
cluster is shown in the third column. The average fraction of cells in a 
sample participating in a meta-cluster, and the standard deviation, are 
shown in the fourth column. 


Tube 

Marker expression 

#Samples 

Fraction of cells 

2 

Kappa*°"Lambda*°"CD19+CD20“ 

5 

6.3%(±6.8) 

3 

CD7+CD4-CD8-CD2- 

4 

18.0%(±4.8) 

4 

CD15-CD13+CD16-CD56- 

17 

16.6%(±6.9) 

4 

CD15-CD13+CD16-CD56+ 

8 

11.1%(±5.7) 

5 

CD14-CDllc-CD64-CD33+ 

10 

13.5%(±5.2) 

5 

CD14-CDllc+CD64-CD33+ 

18 

10.8%(±3.8) 

5 

CD14'°"CDllc+CD64^°"‘CD33+ 

6 

13.8%(±4.3) 

6 

HLA-DR+CD117+CD34+CD38+ 

11 

13.3%(±2.6) 

6 

HLA-DR+CD117^CD34+CD38+ 

13 

17.3%(±6.6) 

6 

HLA-DR-CD117±CD34-CD38+ 

5 

12.9%(±4.7) 

7 

CD5-CD19+CD3-CD10- 

3 

12.3%(±2.4) 

7 

CD5+CD19-CD3-CD10- 

3 

10.0%(±8.5) 

7 

CD5-CD19-CD3-CD10+ 

1 

9.9% 


the immunophenotypes represented by each AML-specihc meta-cluster in each tube, 
omitting the iso type control Tube 1 and unstained Tube 8. 

Tube 6 is the most important panel for diagnosing AML since it includes several 
markers expressed by AML blasts. HLA-DR is an MHC class II cell surface receptor 
complex that is expressed on antigen-presenting cells, e.g., B cells, dendritic cells, 
macrophages, and activated T cells. It is expressed by myeloblasts in most subtypes 


of AML except M3 and M7 171 . CD117 is a tyrosine kinase receptor (c-KIT) 
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expressed in blasts of some cases (30 — 100%) of AML 171 . CD34 is a cell adhesion 
molecule expressed on different stem cells and on the blast cells of many cases of 
AML (40%) [m . CD38 is a glycoprotein found on the surface of blasts of several 


subtypes of AML but usually not expressed in the M3 subtypes of AML 173 . In Tube 
6 , the matching algorithm has identihed two AML-specihc meta-clusters with high 
expressions of HLA-DR and CD34. One of them also expresses CD117 and CD34, 
and Fig. 7.3[c) shows the bivariate contour plots of the cell populations contained 
in this meta-cluster. The second meta-cluster expresses positive but low levels of 
CD117 and CD34. These two HLA-DR+CD34+ meta-clusters together are present 
in 18 out of the 23 training AML samples. The remaining hve samples (subject id: 
5, 7, 103, 165, 174) express HLA-DR“CD117^CD34“CD38’'' myeloblasts, which is 


an immunophenotype of APL 170 as was discussed earlier. Fig. 7.3 d) shows the 
bivariate contour plots of this APL-specihc meta-cluster. 

Tube 5 contains several antigens typically expressed by AML blasts, of which 
CD33 is the most important. CD33 is a transmembrane receptor protein usually 
expressed on immature myeloid cells of the majority of cases of AML (91% reported 
in 174 ). The AML specihc meta-clusters identihed from markers in Tube 5 (see 


Table 7.2) include CD33^ myeloblasts from every sample in the training set. Several 
of the CD33’'' populations also express CDllc, a type I transmembrane protein found 
on monocytes, macrophages and neutrophils. CDllc is usually expressed by blast 
cells in acute myelomonocytic leukemia (M4 subclass of AML), and acute monocytic 
leukemia (M5 subclass of AML) [171| . Therefore CD14“CDllc’''CD64“CD33+ meta¬ 
cluster could represent patients with M4 and M5 subclasses of AML. I show the 
bivariate contour plots of this meta-cluster in Fig. |7.3[ b) . 

Tube 4 includes several markers usually expressed by AML blasts, of which CD13 
is the most important. CD13 is a zinc-metalloproteinase enzyme that binds to the cell 
membrane and degrades regulatory peptides [172 . CD 13 is expressed on the blast 
cells of the majority of cases of AML (95% as reported in 174| ). Table 7.2 shows 
two AML-specihc meta-clusters detected from the blast cells in Tube 4. Both of the 
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meta-clusters are GDIS’*' and they include populations from every AML sample in 
the training set. In addition to CD13, eight AML samples express CD56 glycoprotein 
that is naturally expressed on NK cells, a subset of CDd’*’ T cells and a subset of 


CD8^ T cells. Raspadori et al. 175 reported that CD56 was more often expressed 
by myeloblasts in FAB subclasses M2 and M5, which covers about 42% of AML cases 


in a study by Legrand et al. 174 . In this dataset, we observe more AML samples ex¬ 
pressing CD13''’CD56“ blasts than expressing CD13^CD56’'’ blasts, which conforms 


to the hndings of Raspadori et al. 175 . Fig. |7.3K a) shows the bivariate contour plots 
of the CD13''’CD56“ meta-cluster. CD15 is a carbohydrate adhesion molecule (not a 
protein) expressed on mature glycoproteins and CD16 is an Fc (Fragment, crystalliz- 
able) receptor protein expressed on mature NK cells, macrophages, and neutrophils. 
They are not usually expressed by myeloblasts and were absent in the AML-related 
meta-clusters. 

Tube 2 is a B cell panel measuring B cell markers CD 19 and CD20, and Kappa 
{k) and Lambda (A), immunoglobulin light chains present on the surface of antibodies 
produced by B lymphocytes. This panel is used in diagnosing diseases such as B-cell 
lymphomas and acute lymphoblastic leukemia (ALL) [171 . B-cell specihc markers 
are occasionally co-expressed with myeloid antigens especially in FAB M2 subtype of 
AML (with chromosomal translocation t(8;21)) |171[|176| . In Tube 2, the matching 
algorithm has identihed a meta-cluster in the myeloblasts that expresses high levels 
of CD19 and low levels of Kappa and Lambda. The hve samples with subject ids 5, 
7, 103, 165, and 174 participating in this meta-cluster possibly belong to the FAB-M2 
subtype of AML. 

Tube 3 is a T cell panel measuring T cell specihc markers CD4, CD8, CD2, and 
CD7. This panel is frequently used in diagnosing T-lineage ALL [171 ,172 . In signih- 
cant minority of cases of AML (e.g., 37% of cases reported in 174| ), CD7 is aberrantly 
expressed by myeloblasts 171| . The matching algorithm identihes a meta-cluster in 
the blast region, which includes four populations expressing CD7+CD4“CD8“CD2“. 
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(a) Tube 6 
HLA-DR+CD117+ 
CD34+CD38+ 
blasts 



CD45-ECD HLA-DR-FITC CD117-PE CD34-PC5 CD38-PC7 


(b) Tube 6 
HLA-DR'CD117± 
CD34'CD38+ 
blasts 


Figure 7.3. Bivariate contour plots (side scatter vs. individual marker) 
for two meta-clusters (one in each row) indicative of AML. The ellipses 
in a subplot denote the 95th quantile contour lines of cell populations in¬ 
cluded in the corresponding meta-cluster. Myeloblast cells have medium 
side scatter (SS) and CD45 expressions. The red lines indicate approxi¬ 
mate myeloblast boundaries (located on the left-most subhgures in each 
row and extended horizontally to the subhgures on the right) and conhrm 
that these meta-clusters represent immunophenotypes of myeloblast cells. 
Blue vertical lines denote the -|-/- boundaries of a marker. Gray subplots 
show contour plots of dominant markers dehning the meta-cluster in the 
same row. (a) HLA-DR+CD117’''CD34+CD38’'' meta-cluster shared by 
11 AML samples in Tube 6. (b) HLA-DR“CD117^CD34“CD38’'' meta¬ 
cluster shared by 5 AML samples in Tube 6. This meta-cluster is in¬ 
dicative of acute promyelocytic leukemia (APL). These bivariate plots are 
shown for illustration only, since the populations of specihc cell types are 
identihed from seven-dimensional data. 


This hnding conhrms that T cell antigens are infrequently expressed by AML blasts, 
and therefore, they are less useful in AML diagnosis and classihcation. 

Tube 7 is a lymphocyte panel with several markers expressed on T and B lym¬ 
phocytes and is less important in detecting AML since they are infrequently expressed 
by AML blasts. In Tube 7, we obtain three samples with CDIQ’*', three samples with 
CD5^ and a sample with CDIO’*' myeloblast cells. The meta-clusters in rows 11-13 


in Table 7.2 conhrm that lymphocyte antigens are infrequently expressed by AML 


blasts and are less useful in AML diagnosis and classihcation 171| 176 
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Table 7.3 

Precision (P), Recall (R), Sensitivity (S) and F-measure (F) of the 
template-based classification in the training set and test set of the AML 
data. The statistical measures are computed for each tube separately and 
two combinations of tubes. 


Tubes 

Training set 


Test set 


P 

R 

S 

F 

P 

R 

S 

F 

1 

1.00 

0.26 

1.00 

0.41 

1.00 

0.15 

1.00 

0.26 

2 

0.86 

0.26 

0.99 

0.40 

0.83 

0.25 

0.99 

0.38 

3 

1.00 

0.52 

1.00 

0.69 

1.00 

0.35 

1.00 

0.52 

4 

0.94 

0.74 

0.99 

0.83 

1.00 

0.75 

1.00 

0.86 

5 

0.75 

0.91 

0.96 

0.82 

0.65 

0.85 

0.94 

0.74 

6 

1.00 

0.70 

1.00 

0.82 

1.00 

0.80 

1.00 

0.89 

7 

0.52 

0.48 

0.94 

0.50 

0.48 

0.60 

0.92 

0.53 

All (2-7) 

1.00 

0.74 

1.00 

0.85 

1.00 

0.85 

1.00 

0.92 

4,5,6 

1.00 

0.96 

1.00 

0.98 

1.00 

1.00 

1.00 

1.00 


7.3.4 Impact of each tube in the classification 

As discussed in the methods section, I build six independent classifiers based 
on the healthy and AML templates created from Tubes 2-7 of the AML dataset. 
Tubes 1 and 8 are omitted since the former is an isotope control and the latter 
is unstained, and therefore, they are uninformative for classihcation. A sample is 
classihed as an AML sample if the classification score is positive, and as a healthy 
sample otherwise. Let true positives (TP) be the number of AML samples correctly 
classihed, true negatives (TN) be the number of healthy samples correctly classihed, 
false positives (FP) be the number of healthy samples incorrectly classihed as AML, 
and false negatives (FN) be the number of AML samples incorrectly classihed as 
healthy. Then, I evaluate the performance of each template-based classiher with the 
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Figure 7.4. Average classification score from Tubes 4,5,6 for each sample 
in the (a) training set and (b) test set. Samples with scores above the 
horizontal line are classihed as AML, and as healthy otherwise. The actual 
class of each sample is also shown. An AML sample (subject id 116) is 
always misclassihed in the training set, and this is discussed in the text. 


well-known four statistical measures: Precision, Recall(Sensitivity), Specihcity, and 
F-value, dehned as Precision = rp^pp , Recall(Sensitivity) = Specihcity = 

TN j t;, 1 2(PrecisionxRecall) rpi r j. i i • i 

pp^rpj,^ , and f -value = precision+Recaii ' -*■ measures take values m the interval 

[0,1], and the higher the values the better the classiher. 

First, I evaluate the impact of each tube in the classihcation of the training sam¬ 


ples. For a training sample X, the classihcation score is computed by comparing it 
with the healthy and AML templates created from the training set after removing 
X. The predicted status of X is then compared against true status to evaluate the 


classihcation accuracy. Table 7.3 (left panel) shows various statistical measures for 
the classihers dehned in Tubes 2-7 of the training set. The classihers based on Tubes 
4, 5, and 6 have the highest sensitivity because these tubes include several markers 


relevant to AML diagnosis 170,171 . The number of true negatives TN is high in ev¬ 
ery tube since the identihcation of healthy samples does not depend on the detection 
of AML-specihc markers. Hence specihcity is close to one for all tubes. Analogously, 
FP is low for most tubes, and we observe high precision for most tubes. The F-value 


is a harmonic mean of precision and recall, and denotes the superior classihcation 
ability of markers in Tubes 4-6. Averaging scores from all tubes does not improve 
the sensitivity and F-value dramatically. However, combining Tubes 4-6 gives almost 
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perfect classification with one misclassification for the training set. I plot the average 
classification scores from Tubes 4-6 for the training samples in Fig. |7.4[ a). On the x- 
axis, the healthy samples are placed hrst followed by the AML samples, and samples 
in each group are placed in the ascending order of the average classihcation score. 
The class labels of samples are also shown (blue circles for healthy and red triangles 
for AML samples). 

In Fig. g a), we observe an AML sample (subject id 116) with score below 


the classification boundary. Fig. |7.5| shows the cell populations present in subject 
116 projected on the SS and CD45 channels. In this subject, the proportion of 
myeloid blasts is 4.4%, which is lower than the minimum 20% AML blasts necessary 


to recognize a patient to be AML-positive according to the WHO guidelines 177 


(the FAB threshold is even higher, at 30%). (Recall Fig. 7.1 for a plot of the cell 


types in healthy and AML samples.) Hence this is either a rare case of AML, or one 
with minimal residual disease after therapy, or perhaps it was incorrectly labeled as 
AML in the training set. Since the template-based classiher weighs individual clusters 
with their cell proportions, the abnormal myelobasts contribute insignificantly in the 


hnal score computed by Eq. |7.1[ As a result. Subject 116 is placed with the healthy 
samples in Fig. |7.4 Subject 116 was classihed with the healthy samples by methods 


in other published work 166 


7.3.5 Classifying test samples 

Now we turn to the test samples. For each tube, I compute the classihcation 
score for each sample in the test set using templates created from the training set 


and applying Eq. '7_A Since the average classihcation score from Tubes 4-6 performs 
best for the training set, it is used as a classiher for the test set as well. Since 
the status of test samples was released after the DREAM6/FlowCAP2 challenge, we 
can determine the classihcation accuracy of the test samples. Fig. |7.4K b) shows the 
classihcation scores of the test samples, where samples are placed in ascending order 













135 


1/1 

i/i 


CO 

d 


lO 

d 


d 


CM 

d 


Subject 116 



myeloid blasts (4.4%) 


^-1- 

0.2 0.4 0.6 0.8 

CD45-ECD (log) 


Lymphocytes 
Lymphoid blasts 
Myeloid cells 
Myeloid blasts 
Monocytes 


Figure 7.5. Cell populations in a sample from subject 116. Even though 
this sample is marked as AML by the organizers of DREAM6/FlowCAP2 
competition, it contains only 4.4% myeloid blast cells (shown in red). 
According to the World Health Organization (WHO) guidelines, AML is 
only confirmed when peripheral blood contains at least 20% immature 
myeloblasts. Hence, this subject is either a rare AML subtype, or has 
been incorrectly labeled. 


of classification scores. In Fig. |7.4[ b), we observe perfect classification in the test 
set. Similar to the training set, I tabulate statistical measures for the classifiers in 
Table lAl 

When classifying a sample X, we can assume the null hypothesis: X is healthy 
(non-leukemic). The sample X receives a positive score if it contains AML-specific 
immunophenotypes, and the higher the score, the stronger the evidence against the 
null hypothesis. Since Tube 1 (isotype control) does not include any AML-specific 
markers, it can provide a background distribution for the classification scores. In 
Tube 1, 174 out of 179 training samples have negative classification scores, but five 
samples have positive scores, with values less than 0.2. In the best classifier designed 
from Tubes 4, 5, 6, we observe that two AML-positive samples in the training set 
and three AML-positive samples in the test set have scores between 0 and 0.2. The 
classifier is relatively less confident about these samples; nevertheless, the p-values of 
these five samples (computed from the distribution in Tube 1) are still small (< 0.05), 
so that they can be classified as AML-positive. The rest of the AML samples in the 
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training and test sets have scores greater than 0.2 and the classifier is qnite confident 
about their status (p-value zero). 

Four AML samples in the test set (ids 239, 262, 285, and 326) were subclassified 
as APL by comparing against distinct template trees for APL and the other AML 


samples in the training set (cf. Fig. 7.2 (b)). 

Finally, I state the computational times required on an iMac with four 2.7 GHz 
cores and 8 GB memory. The code is written in G++ and R. Gonsider a single tube 
with 359 samples in it. The fc-means clustering of all samples took one hour, primarily 
because we need to run the algorithm multiple times (about ten on the average) to 
find the optimal value of the number of clusters. Greating the healthy template from 
156 samples in the training set required 10 seconds (s) on one core, and the AML 
template for 23 AML samples took 0.5s on one core. Gross validation (leave one out) 
of the training set took 30 minutes, and computing the classification score for the 180 
test samples took 15s, both on four cores. I could have reduced the running time by 
executing the code in parallel on more cores. In an ongoing project, we have made 
the dominant step, the fc-means clustering of all the samples with an optimal number 
of clusters, faster using a GPU, reducing the total time to a few minutes. 


7.4 Gonclusions 

In this chapter, I have demonstrated that an algorithmic pipeline for template- 
based classification can successfully identify immunophenotypes of clinical interest in 
AML. These could be used to differentiate the subtypes of AML, which is advanta¬ 
geous since prognosis and treatment depends on the subtype. The templates enable 
us to classify AML samples in spite of their heterogeneity. This was accomplished by 
creating a scoring function that accounts for the subtleties in cell populations within 
AML samples. We are currently applying this approach to a larger AML data set, 
and intend to analyze other heterogeneous data sets. 
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8 CONCLUSIONS AND FUTURE WORK 


8.1 Conclusions 


In my dissertation, I have explored, designed and developed a collection of al¬ 
gorithms for analyzing multi-parametric flow cytometry data. These algorithms 
solve sample preprocessing, variance stabilization, clustering, population registra¬ 
tion, meta-clustering, sample classihcation, and other related problems arising in FC 
data analysis. I have assembled the algorithms into a data analysis pipeline, flow- 
Match 1^, that is made available as an open-source R package in Bioconductor 
(http://www.bioconductor.org/). The pipeline has been employed to analyze sev¬ 
eral practical datasets and the results are published in peer-reviewed journals and 


conferences 36 37l46 . 


Flow cytometry (FC) is a widely used platform for measuring phenotypes of 
individual cells from millions of cells in biological samples. Modern FC technol¬ 
ogy gives rise to high-dimensional and high-content data that challenges the ability 
to manually investigate the data and interpret the results. To address the limita¬ 
tions of manual analysis, researchers have developed a new branch of research called 
the “computational cytometry”. Along with a number of contemporary software 
tools 14, ^ 24,26,27,31, ^ 39,56,^, flowMatch automates different analysis steps 
of FC data with an aim to prevent the data analysis from being the bottleneck in 
scientihc discovery based on cytometry. 

The flowMatch pipeline consists of six well dehned algorithmic modules for (1) 
unmixing of spectra (compensation) to remove the effect of overlapping fluorescence 
channels, (2) transforming data in order to stabilize variance and normalize data, 
(3) identifying cell population by automated gating or clustering algorithm in the 
high-dimensional marker space, (4) registering cell populations (cluster labeling or 
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matching) across samples, (5) representing a class of samples with high-level tem¬ 
plates, and (6) classifying samples based on the phenotypic pattern of the cell clus¬ 
ters. Each module of flowMatch is designed to perform a specihc task independent 
of other modules of the pipeline. However, they can also be employed sequentially in 


the order described in Fig. 1.6 to perform the complete data analysis. 

I have developed a variance stabilization (VS) method called flowVS for removing 
the mean-variance correlation arisen in fluorescence based flow cytometry, flow VS 
stabilizes variance by transforming FC samples with the inverse hyperbolic sine func¬ 
tion whose parameters are estimated by minimizing Bartlett’s likelihood-ratio statis¬ 
tics. For population identihcation, I demonstrate that several simple, off-the-shelf 
clustering algorithms provide more accurate and robust partition of an FC sample. I 
describe several cluster validation methods that can be used to select the optimum 
parameters for a clustering algorithm (e.g., the optimum number of clusters), as well 
as choosing the best algorithm for a dataset. Furthermore, I describe an algorithm 
for computing the consensus of multiple clustering solutions, which performs better 
than any individual algorithm. 


For population registration 27,36 , I developed a robust mixed edge cover (MFC) 
algorithm that matches cell populations across a pair of FC samples. The MEC al¬ 
gorithm uses a robust graph-theoretic framework to match a cluster from a sample 
to zero or more clusters in another sample and thus solves the problem of missing or 
splitting populations. Next, I describe a hierarchical algorithm for encapsulating a 
collection of sample belonging to the same class with a template. A template is a col¬ 
lection of relatively homogeneous meta-clusters commonly shared across samples of a 
given class, thus describing the key immunophenotypes of an overall class of samples. 
Finally, I demonstrate that the use of templates leads to efficient classihcation algo¬ 
rithms. By comparing a sample with class templates, the sample is predicted to come 
from a class whose template it is most similar to. The template-base classihcation is 
robust and efficient because it compares samples to cleaner and fewer class templates 
rather than the large number of noisy samples themselves. 
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I have employed different components of flowMatch for classifying leukemia sam¬ 
ples, evaluating the phosphorylation effects on T cells, classifying healthy immune 
profiles, comparing the impact of two treatments against Multiple Sclerosis, and clas¬ 
sifying the HIV vaccination status. In these analyses, flowMatch is able to reach 
biologically meaningful conclusions with the automated algorithms. The algorithms 
included in flowMatch can also be applied to problems outside of flow cytometry such 
as from microarray data analysis and image recognitions. 

8.2 Future work 

Application to larger and higher-dimensional datasets: Thus far, I have em¬ 
ployed flowMatch to five FC datasets, the largest of which is an AML dataset 
with nearly 3000 seven-dimensional samples. More rigorous testing with other 
datasets will certainly increase the confidence about the correctness and efficiency 
of the pipeline. I am in the process of getting access to large cytomics data from 
Immune Tolerance Network (http://www.immunetolerance.org/), FlowRepository 
(http://flowrepository.org/), Cytobank (http://www.cytobank.org), and from 
other public repositories. I have tested the flowMatch pipeline with up to eight di¬ 
mensional samples. However, the newly developed mass cytometry technology can 
investigate more than 40 parameters of a single cell. flowMatch needs to be tested 
and enhanced (if needed) for this high dimensional data. 

Enhancing the modules of the flowMatch pipeline: The flowMatch pipeline is still 
under development. I plan to enhance different components of the pipeline with ad¬ 
ditional functionalities from which the users can choose depending on the objective 
of the data analysis. For variance stabilization, in addition to the inverse hyperbolic 
sine (asinh) function, several other transformations can be tested. It also remains a 
future work to design and implement a multi-dimensional variance stabilization algo¬ 
rithm. In current work, I am incorporating the proportion of cells into the population 
registration algorithm, which will lead to a between-sample distance metric similar 
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in spirit of the earth mover’s distance 133,139 . For the meta-clustering algorithm, I 
investigate the use of networks instead of trees to organize the templates, similar in 


spirit to the use of networks rather than trees in phylogenetics 160 . Finally, I am 


considering other classihcation algorithms, such as the logistic regression 178 , for 
complementing the template-based classihcation methods. 

Parallelizing algorithms for faster processing: At present, the modules of flow- 
Match are implemented as serial algorithms. However, given the ever increasing 
volume of data and the availability of high-performance computers, the algorithms 
of the pipeline can be parallelized to analyze a dataset faster without incurring any 
additional cost. I consider parallelizing flowMatch at different levels. The variance 
stabilization algorithm is embarrassingly parallel where density peaks on different 
channels can be identihed and their likelihood-ratio test can be evaluated on different 
processing units. A number of parallel clustering algorithms have been discussed in 


the literature 179 181 and they can be integrated into the/?owMafc/i pipeline. The 
quality of partitions can be evaluated in parallel for different choices of clustering 
parameters. The population registration problem can be solved faster by paralleliz¬ 
ing the mixed edge cover algorithm. I have already developed algorithms for parallel 


cardinality matching on bipartite graphs 47 49 . Developing parallel algorithms for 
weighted matching problem remains a future work. However, the number of pop¬ 
ulations is generally small (tens to hundreds) in typical FC samples, and therefore 
parallel population registration might not decrease the processing time signihcantly. 
The samples are processed in a sequential order by the meta-clustering algorithm, 
which is difficult to parallelize. However, we can relax the strict order of sample 
processing, and then parallelize the hierarchical algorithm. In the dynamic classihca- 
tion algorithm, we can insert multiple samples into the template tree simultaneously 
whenever the inserted samples update different parts of the template-tree. 

Applying the pipeline to problem outside of the flow cytometry: Stabilizing vari¬ 
ance, clustering, matching, creating templates are general concepts with applications 
to different areas. Therefore, the algorithms developed in this dissertation can be 













141 


applied - possibly with simple modifications- to problems outside of flow cytometry. 
I have already applied the variance stabilization framework to microarray data and 
compared the results with a state-of-the-art software developed for microarrays in 
Chapter Likewise, other algorithms have applications to problems from microar¬ 
rays, Chip-Seq, etc. 

FC data projected on a lower dimension has considerable similarities with im¬ 
ages from tradition photography and from bio-imaging technologies such as imaging 
cytometry, MRI, etc. Therefore, the algorithms in flowMatch have direct applica¬ 
tion in analyzing images from these technologies. For example, consider an image 
recognition application where the clustering algorithm segments images into different 
features (eyes, nose, etc.), the matching algorithm registers different features across 
images, and the classification algorithm recognizes the images with templates created 
from existing image library. This image recognition problem can be solved by the 
clustering, matching and classification algorithms from the flowMatch pipeline. 

I expect this dissertation to be a strong algorithmic contribution with application 
to flow cytometry, as well as to domains outside of flow cytometry. 
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